Data Engineering Podcast

Tobias Macey

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Episodes

Mentioned books

Jan 27, 2020 • 47min

Pay Down Technical Debt In Your Data Pipeline With Great Expectations

Summary Data pipelines are complicated and business critical pieces of technical infrastructure. Unfortunately they are also complex and difficult to test, leading to a significant amount of technical debt which contributes to slower iteration cycles. In this episode James Campbell describes how he helped create the Great Expectations framework to help you gain control and confidence in your data delivery workflows, the challenges of validating and monitoring the quality and accuracy of your data, and how you can use it in your own environments to improve your ability to move fast. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing James Campbell about Great Expectations, the open source test framework for your data pipelines which helps you continually monitor and validate the integrity and quality of your data Interview Introduction How did you get involved in the area of data management? Can you start by explaining what Great Expecations is and the origin of the project? What has changed in the implementation and focus of Great Expectations since we last spoke on Podcast.__init__ 2 years ago? Prior to your introduction of Great Expectations what was the state of the industry with regards to testing, monitoring, or validation of the health and quality of data and the platforms operating on them? What are some of the types of checks and assertions that can be made about a pipeline using Great Expectations? What are some of the non-obvious use cases for Great Expectations? What aspects of a data pipeline or the context that it operates in are unable to be tested or validated in a programmatic fashion? Can you describe how Great Expectations is implemented? For anyone interested in using Great Expectations, what is the workflow for incorporating it into their environments? What are some of the test cases that are often overlooked which data engineers and pipeline operators should be considering? Can you talk through some of the ways that Great Expectations can be extended? What are some notable extensions or integrations of Great Expectations? Beyond the testing and validation of data as it is being processed you have also included features that support documentation and collaboration of the data lifecycles. What are some of the ways that those features can benefit a team working with Great Expectations? What are some of the most interesting/innovative/unexpected ways that you have seen Great Expectations used? What are the limitations of Great Expectations? What are some cases where Great Expectations would be the wrong choice? What do you have planned for the future of Great Expectations? Contact Info LinkedIn @jpcampbell42 on Twitter jcampbell on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Great Expectations GitHub Twitter Podcast.__init__ Interview on Great Expectations Superconductive Health Abe Gong Pandas Podcast.__init__ Interview SQLAlchemy PostgreSQL Podcast Episode RedShift BigQuery Spark Cloudera DataBricks Great Expectations Data Docs Great Expectations Data Profiling Apache NiFi Amazon Deequ Tensorflow Data Validation The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Jan 20, 2020 • 39min

Replatforming Production Dataflows

Summary Building a reliable data platform is a neverending task. Even if you have a process that works for you and your business there can be unexpected events that require a change in your platform architecture. In this episode the head of data for Mayvenn shares their experience migrating an existing set of streaming workflows onto the Ascend platform after their previous vendor was acquired and changed their offering. This is an interesting discussion about the ongoing maintenance and decision making required to keep your business data up to date and accurate. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Sheel Choksi and Sean Knapp about Mayvenn’s experience migrating their dataflows onto the Ascend platform Interview Introduction How did you get involved in the area of data management? Can you start off by describing what Mayvenn is and give a sense of how you are using data? What are the sources of data that you are working with? What are the biggest challenges you are facing in collecting, processing, and analyzing your data? Before adopting Ascend, what did your overall platform for data management look like? What were the pain points that you were facing which led you to seek a new solution? What were the selection criteria that you set forth for addressing your needs at the time? What were the aspects of Ascend which were most appealing? What are some of the edge cases that you have dealt with in the Ascend platform? Now that you have been using Ascend for a while, what components of your previous architecture have you been able to retire? Can you talk through the migration process of incorporating Ascend into your platform and any validation that you used to ensure that your data operations remained accurate and consistent? How has the migration to Ascend impacted your overall capacity for processing data or integrating new sources into your analytics? What are your future plans for how to use data across your organization? Contact Info Sheel LinkedIn sheelc on GitHub Sean LinkedIn @seanknapp on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Mayvenn Ascend Podcast Episode Google Sawzall Clickstream Apache Kafka Alooma Podcast Episode Amazon Redshift ELT == Extract, Load, Transform DBT Podcast Episode Amazon Data Pipeline Upsolver Pentaho Stitch Data Fivetran Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Jan 13, 2020 • 1h 1min

Planet Scale SQL For The New Generation Of Applications With YugabyteDB

SummaryThe modern era of software development is identified by ubiquitous access to elastic infrastructure for computation and easy automation of deployment. This has led to a class of applications that can quickly scale to serve users worldwide. This requires a new class of data storage which can accomodate that demand without having to rearchitect your system at each level of growth. YugabyteDB is an open source database designed to support planet scale workloads with high data density and full ACID compliance. In this episode Karthik Ranganathan explains how Yugabyte is architected, their motivations for being fully open source, and how they simplify the process of scaling your application from greenfield to global.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.Your host is Tobias Macey and today I’m interviewing Karthik Ranganathan about YugabyteDB, the open source, high-performance distributed SQL database for global, internet-scale apps.InterviewIntroductionHow did you get involved in the area of data management?Can you start by describing what YugabyteDB is and its origin story?A growing trend in database engines (e.g. FaunaDB, CockroachDB) has been an out of the box focus on global distribution. Why is that important and how does it work in Yugabyte? What are the caveats?What are the most notable features of YugabyteDB that would lead someone to choose it over any of the myriad other options? What are the use cases that it is uniquely suited to?What are some of the systems or architecture patterns that can be replaced with Yugabyte?How does the design of Yugabyte or the different ways it is being used influence the way that users should think about modeling their data?Yugabyte is an impressive piece of engineering. Can you talk through the major design elements and how it is implemented?Easy scaling and failover is a feature that many database engines would like to be able to claim. What are the difficult elements that prevent them from implementing that capability as a standard practice? What do you have to sacrifice in order to support the level of scale and fault tolerance that you provide?Speaking of scaling, there are many ways to define that term, from vertical scaling of storage or compute, to horizontal scaling of compute, to scaling of reads and writes. What are the primary scaling factors that you focus on in Yugabyte?How do you approach testing and validation of the code given the complexity of the system that you are building?In terms of the query API you have support for a Postgres compatible SQL dialect as well as a Cassandra based syntax. What are the benefits of targeting compatibility with those platforms? What are the challenges and benefits of maintaining compatibility with those other platforms?Can you describe how the storage layer is implemented and the division between the different query formats?What are the operational characteristics of YugabyteDB? What are the complexities or edge cases that users should be aware of when planning a deployment?One of the challenges of working with large volumes of data is creating and maintaining backups. How does Yugabyte handle that problem?Most open source infrastructure projects that are backed by a business withhold various "enterprise" features such as backups and change data capture as a means of driving revenue. Can you talk through your motivation for releasing those capabilities as open source?What is the business model that you are using for YugabyteDB and how does it differ from the tribal knowledge of how open source companies generally work?What are some of the most interesting, innovative, or unexpected ways that you have seen yugabyte used?When is Yugabyte the wrong choice?What do you have planned for the future of the technical and business aspects of Yugabyte?Contact Info@karthikr on TwitterLinkedInrkarthik007 on GitHubParting QuestionFrom your perspective, what is the biggest gap in the tooling or technology for data management today?Closing AnnouncementsThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.To help other people find the show please leave a review on iTunes and tell your friends and co-workersJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chatLinksYugabyteDBGitHubNutanixFacebook EngineeringApache CassandraApache HBaseDelphiFuanaDBPodcast EpisodeCockroachDBPodcast EpisodeHA == High AvailabilityOracleMicrosoft SQL ServerPostgreSQLPodcast EpisodeMongoDBAmazon AuroraPGCryptoPostGISpl/pgsqlForeign Data WrappersPipelineDBPodcast EpisodeCitusPodcast EpisodeJepsen TestingYugabyte Jepsen Test ResultsOLTP == Online Transaction ProcessingOLAP == Online Analytical ProcessingDocDBGoogle SpannerGoogle BigTableSpot InstancesKubernetesCloudformationTerraformPrometheusDebeziumPodcast EpisodeThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Jan 6, 2020 • 53min

Change Data Capture For All Of Your Databases With Debezium

Summary Databases are useful for inspecting the current state of your application, but inspecting the history of that data can get messy without a way to track changes as they happen. Debezium is an open source platform for reliable change data capture that you can use to build supplemental systems for everything from maintaining audit trails to real-time updates of your data warehouse. In this episode Gunnar Morling and Randall Hauch explain why it got started, how it works, and some of the myriad ways that you can use it. If you have ever struggled with implementing your own change data capture pipeline, or understanding when it would be useful then this episode is for you. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Randall Hauch and Gunnar Morling about Debezium, an open source distributed platform for change data capture Interview Introduction How did you get involved in the area of data management? Can you start by describing what Change Data Capture is and some of the ways that it can be used? What is Debezium and what problems does it solve? What was your motivation for creating it? What are some of the use cases that it enables? What are some of the other options on the market for handling change data capture? Can you describe the systems architecture of Debezium and how it has evolved since it was first created? How has the tight coupling with Kafka impacted the direction and capabilities of Debezium? What, if any, other substrates does Debezium support (e.g. Pulsar, Bookkeeper, Pravega)? What are the data sources that are supported by Debezium? Given that you have branched into non-relational stores, how have you approached organization of the code to allow for handling the specifics of those engines while retaining a common core set of functionality? What is involved in deploying, integrating, and maintaining an installation of Debezium? What are the scaling factors? What are some of the edge cases that users and operators should be aware of? Debezium handles the ingestion and distribution of database changesets. What are the downstream challenges or complications that application designers or systems architects have to deal with to make use of that information? What are some of the design tensions that exist in the Debezium community between acting as a simple pipe vs. adding functionality for interpreting/aggregating/formatting the information contained in the changesets? What are some of the common downstream systems that consume the outputs of Debezium? What challenges or complexities are involved in building clients that can consume the changesets from the different engines that you support? What are some of the most interesting, unexpected, or innovative ways that you have seen Debezium used? What have you found to be the most challenging, complex, or complicated aspects of building, maintaining, and growing Debezium? What is in store for the future of Debezium? Contact Info Randall LinkedIn @rhauch on Twitter rhauch on GitHub Gunnar gunnarmorling on GitHub @gunnarmorling on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Debezium Confluent Kafka Connect RedHat Bean Validation Change Data Capture DBMS == DataBase Management System Apache Kafka Apache Flink Podcast Episode Yugabyte DB PostgreSQL Podcast Episode MySQL Microsoft SQL Server Apache Pulsar Podcast Episode Pravega Podcast Episode NATS Amazon Kinesis Pulsar IO WePay The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Dec 30, 2019 • 46min

Building The DataDog Platform For Processing Timeseries Data At Massive Scale

Summary DataDog is one of the most successful companies in the space of metrics and monitoring for servers and cloud infrastructure. In order to support their customers, they need to capture, process, and analyze massive amounts of timeseries data with a high degree of uptime and reliability. Vadim Semenov works on their data engineering team and joins the podcast in this episode to discuss the challenges that he works through, the systems that DataDog has built to power their business, and how their teams are organized to allow for rapid growth and massive scale. Getting an inside look at the companies behind the services we use is always useful, and this conversation was no exception. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Vadim Semenov about how data engineers work at DataDog Interview Introduction How did you get involved in the area of data management? For anyone who isn’t familiar with DataDog, can you start by describing the types and volumes of data that you’re dealing with? What are the main components of your platform for managing that information? How are the data teams at DataDog organized and what are your primary responsibilities in the organization? What are some of the complexities and challenges that you face in your work as a result of the volume of data that you are processing? What are some of the strategies which have proven to be most useful in overcoming those challenges? Who are the main consumers of your work and how do you build in feedback cycles to ensure that their needs are being met? Given that the majority of the data being ingested by DataDog is timeseries, what are your lifecycle and retention policies for that information? Most of the data that you are working with is customer generated from your deployed agents and API integrations. How do you manage cleanliness and schema enforcement for the events as they are being delivered? What are some of the upcoming projects that you have planned for the upcoming months and years? What are some of the technologies, patterns, or practices that you are hoping to adopt? Contact Info LinkedIn @databuryat on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links DataDog Hadoop Hive Yarn Chef SRE == Site Reliability Engineer Application Performance Management (APM) Apache Kafka RocksDB Cassandra Apache Parquet data serialization format SLA == Service Level Agreement WatchDog Apache Spark Podcast Episode Apache Pig Databricks JVM == Java Virtual Machine Kubernetes SSIS (SQL Server Integration Services) Pentaho JasperSoft Apache Airflow Podcast.__init__ Episode Apache NiFi Podcast Episode Luigi Dagster Podcast Episode Prefect The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Dec 23, 2019 • 48min

Building The Materialize Engine For Interactive Streaming Analytics In SQL

Summary Transactional databases used in applications are optimized for fast reads and writes with relatively simple queries on a small number of records. Data warehouses are optimized for batched writes and complex analytical queries. Between those use cases there are varying levels of support for fast reads on quickly changing data. To address that need more completely the team at Materialize has created an engine that allows for building queryable views of your data as it is continually updated from the stream of changes being generated by your applications. In this episode Frank McSherry, chief scientist of Materialize, explains why it was created, what use cases it enables, and how it works to provide fast queries on continually updated data. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Frank McSherry about Materialize, an engine for maintaining materialized views on incrementally updated data from change data captures Interview Introduction How did you get involved in the area of data management? Can you start by describing what Materialize is and the problems that you are aiming to solve with it? What was your motivation for creating it? What use cases does Materialize enable? What are some of the existing tools or systems that you have seen employed to address those needs which can be replaced by Materialize? How does it fit into the broader ecosystem of data tools and platforms? What are some of the use cases that Materialize is uniquely able to support? How is Materialize architected and how has the design evolved since you first began working on it? Materialize is based on your timely-dataflow project, which itself is based on the work you did on Naiad. What was your reasoning for using Rust as the implementation target and what benefits has it provided? What are some of the components or primitives that were missing in the Rust ecosystem as compared to what is available in Java or C/C++, which have been the dominant languages for distributed data systems? In the list of features, you highlight full support for ANSI SQL 92. What were some of the edge cases that you faced in complying with that standard given the distributed execution context for Materialize? A majority of SQL oriented platforms define custom extensions or built-in functions that are specific to their problem domain. What are some of the existing or planned additions for Materialize? Can you talk through the lifecycle of data as it flows from the source database and through the Materialize engine? What are the considerations and constraints on maintaining the full history of the source data within Materialize? For someone who wants to use Materialize, what is involved in getting it set up and integrated with their data sources? What is the workflow for defining and maintaining a set of views? What are some of the complexities that users might face in ensuring the ongoing functionality of those views? For someone who is unfamiliar with the semantics of streaming SQL, what are some of the conceptual shifts that they should be aware of? The Materialize product is currently pre-release. What are the remaining steps before launching it? What do you have planned for the future of the product and company? Contact Info frankmcsherry on GitHub @frankmcsherry on Twitter Blog Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Materialize Timely Dataflow Dryad: Distributed Data-Parallel Programs from SequentialBuilding Blocks [Naiad](Programs from SequentialBuilding Blocks): A Timely Dataflow System Differential Privacy PageRank Data Council Presentation on Materialize Change Data Capture Debezium Apache Spark Podcast Episode Flink Podcast Episode Go language Rust Haskell Rust Borrow Checker GDB (GNU Debugger) Avro Apache Calcite ANSI SQL 92 Correlated Subqueries OOM (Out Of Memory) Killer Log-Structured Merge Tree The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Dec 16, 2019 • 1h 2min

Solving Data Lineage Tracking And Data Discovery At WeWork

Summary Building clean datasets with reliable and reproducible ingestion pipelines is completely useless if it’s not possible to find them and understand their provenance. The solution to discoverability and tracking of data lineage is to incorporate a metadata repository into your data platform. The metadata repository serves as a data catalog and a means of reporting on the health and status of your datasets when it is properly integrated into the rest of your tools. At WeWork they needed a system that would provide visibility into their Airflow pipelines and the outputs produced. In this episode Julien Le Dem and Willy Lulciuc explain how they built Marquez to serve that need, how it is architected, and how it compares to other options that you might be considering. Even if you already have a metadata repository this is worth a listen to learn more about the value that visibility of your data can bring to your organization. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You work hard to make sure that your data is clean, reliable, and reproducible throughout the ingestion pipeline, but what happens when it gets to the data warehouse? Dataform picks up where your ETL jobs leave off, turning raw data into reliable analytics. Their web based transformation tool with built in collaboration features lets your analysts own the full lifecycle of data in your warehouse. Featuring built in version control integration, real-time error checking for their SQL code, data quality tests, scheduling, and a data catalog with annotation capabilities it’s everything you need to keep your data warehouse in order. Sign up for a free trial today at dataengineeringpodcast.com/dataform and email team@dataform.co with the subject "Data Engineering Podcast" to get a hands-on demo from one of their data experts. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference, the Strata Data conference, and PyCon US. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Willy Lulciuc and Julien Le Dem about Marquez, an open source platform to collect, aggregate, and visualize a data ecosystem’s metadata Interview Introduction How did you get involved in the area of data management? Can you start by describing what Marquez is? What was missing in existing metadata management platforms that necessitated the creation of Marquez? How do the capabilities of Marquez compare with tools and services that bill themselves as data catalogs? How does it compare to the Amundsen platform that Lyft recently released? What are some of the tools or platforms that are currently integrated with Marquez and what additional integrations would you like to see? What are some of the capabilities that are unique to Marquez and how are you using them at WeWork? What are the primary resource types that you support in Marquez? What are some of the lowest common denominator attributes that are necessary and useful to track in a metadata repository? Can you explain how Marquez is architected and how the design has evolved since you first began working on it? Many metadata management systems are simply a service layer on top of a separate data storage engine. What are the benefits of using PostgreSQL as the system of record for Marquez? What are some of the complexities that arise from relying on a relational engine as opposed to a document store or graph database? How is the metadata itself stored and managed in Marquez? How much up-front data modeling is necessary and what types of schema representations are supported? Can you talk through the overall workflow of someone using Marquez in their environment? What is involved in registering and updating datasets? How do you define and track the health of a given dataset? What are some of the interesting questions that can be answered from the information stored in Marquez? What were your assumptions going into this project and how have they been challenged or updated as you began using it for production use cases? For someone who is interested in using Marquez what is involved in deploying and maintaining an installation of it? What have you found to be the most challenging or unanticipated aspects of building and maintaining a metadata repository and data discovery platform? When is Marquez the wrong choice for a metadata repository? What do you have planned for the future of Marquez? Contact Info Julien Le Dem @J_ on Twitter Email julienledem on GitHub Willy LinkedIn @wslulciuc on Twitter wslulciuc on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Marquez DataEngConf Presentation WeWork Canary Yahoo Dremio Hadoop Pig Parquet Podcast Episode Airflow Apache Atlas Amundsen Podcast Episode Uber DataBook LinkedIn DataHub Iceberg Table Format Podcast Episode Delta Lake Podcast Episode Great Expectations data pipeline unit testing framework Podcast.__init__ Episode Redshift SnowflakeDB Podcast Episode Apache Kafka Schema Registry Podcast Episode Open Tracing Jaeger Zipkin DropWizard Java framework Marquez UI Cayley Graph Database Kubernetes Marquez Helm Chart Marquez Docker Container Dagster Podcast Episode Luigi DBT Podcast Episode Thrift Protocol Buffers The intro and outro music is from

Dec 9, 2019 • 59min

SnowflakeDB: The Data Warehouse Built For The Cloud

Summary Data warehouses have gone through many transformations, from standard relational databases on powerful hardware, to column oriented storage engines, to the current generation of cloud-native analytical engines. SnowflakeDB has been leading the charge to take advantage of cloud services that simplify the separation of compute and storage. In this episode Kent Graziano, chief technical evangelist for SnowflakeDB, explains how it is differentiated from other managed platforms and traditional data warehouse engines, the features that allow you to scale your usage dynamically, and how it allows for a shift in your workflow from ETL to ELT. If you are evaluating your options for building or migrating a data platform, then this is definitely worth a listen. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media and the Python Software Foundation. Upcoming events include the Software Architecture Conference in NYC and PyCOn US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Kent Graziano about SnowflakeDB, the cloud-native data warehouse Interview Introduction How did you get involved in the area of data management? Can you start by explaining what SnowflakeDB is for anyone who isn’t familiar with it? How does it compare to the other available platforms for data warehousing? How does it differ from traditional data warehouses? How does the performance and flexibility affect the data modeling requirements? Snowflake is one of the data stores that is enabling the shift from an ETL to an ELT workflow. What are the features that allow for that approach and what are some of the challenges that it introduces? Can you describe how the platform is architected and some of the ways that it has evolved as it has grown in popularity? What are some of the current limitations that you are struggling with? For someone getting started with Snowflake what is involved with loading data into the platform? What is their workflow for allocating and scaling compute capacity and running anlyses? One of the interesting features enabled by your architecture is data sharing. What are some of the most interesting or unexpected uses of that capability that you have seen? What are some other features or use cases for Snowflake that are not as well known or publicized which you think users should know about? When is SnowflakeDB the wrong choice? What are some of the plans for the future of SnowflakeDB? Contact Info LinkedIn Website @KentGraziano on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links SnowflakeDB Free Trial Stack Overflow Data Warehouse Oracle DB MPP == Massively Parallel Processing Shared Nothing Architecture Multi-Cluster Shared Data Architecture Google BigQuery AWS Redshift AWS Redshift Spectrum Presto Podcast Episode SnowflakeDB Semi-Structured Data Types Hive ACID == Atomicity, Consistency, Isolation, Durability 3rd Normal Form Data Vault Modeling Dimensional Modeling JSON AVRO Parquet SnowflakeDB Virtual Warehouses CRM == Customer Relationship Management Master Data Management Podcast Episode FoundationDB Podcast Episode Apache Spark Podcast Episode SSIS == SQL Server Integration Services Talend Informatica Fivetran Podcast Episode Matillion Apache Kafka Snowpipe Snowflake Data Exchange OLTP == Online Transaction Processing GeoJSON Snowflake Documentation SnowAlert Splunk Data Catalog The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Dec 3, 2019 • 46min

Organizing And Empowering Data Engineers At Citadel

Summary The financial industry has long been driven by data, requiring a mature and robust capacity for discovering and integrating valuable sources of information. Citadel is no exception, and in this episode Michael Watson and Robert Krzyzanowski share their experiences managing and leading the data engineering teams that power the business. They shared helpful insights into some of the challenges associated with working in a regulated industry, organizing teams to deliver value rapidly and reliably, and how they approach career development for data engineers. This was a great conversation for an inside look at how to build and maintain a data driven culture. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Michael Watson and Robert Krzyzanowski about the technical and organizational challenges that he and his team are working on at Citadel Interview Introduction How did you get involved in the area of data management? Can you start by describing the size and structure of the data engineering teams at Citadel? How have the scope and nature of responsibilities for data engineers evolved over the past few years at Citadel as more and better tools and platforms have been made available in the space and machine learning techniques have grown more sophisticated? Can you describe the types of data that you are working with at Citadel? What is the process for identifying, evaluating, and ingesting new sources of data? What are some of the common core aspects of your data infrastructure? What are some of the ways that it differs across teams or projects? How involved are data engineers in the overall product design and delivery lifecycle? For someone who joins your team as a data engineer, what are some of the options available to them for a career path? What are some of the challenges that you are currently facing in managing the data lifecycle for projects at Citadel? What are some tools or practices that you are excited to try out? Contact Info Michael LinkedIn @detroitcoder on Twitter detroitcoder on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Citadel Python Hedge Fund Quantitative Trading Citadel Securities Apache Airflow Jupyter Hub Alembic database migrations for SQLAlchemy Terraform DQM == Data Quality Management Great Expectations Podcast.__init__ Episode Nomad RStudio Active Directory The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Nov 26, 2019 • 1h 1min

Building A Real Time Event Data Warehouse For Sentry

Summary The team at Sentry has built a platform for anyone in the world to send software errors and events. As they scaled the volume of customers and data they began running into the limitations of their initial architecture. To address the needs of their business and continue to improve their capabilities they settled on Clickhouse as the new storage and query layer to power their business. In this episode James Cunningham and Ted Kaemming describe the process of rearchitecting a production system, what they learned in the process, and some useful tips for anyone else evaluating Clickhouse. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Ted Kaemming and James Cunningham about Snuba, the new open source search service at Sentry implemented on top of Clickhouse Interview Introduction How did you get involved in the area of data management? Can you start by describing the internal and user-facing issues that you were facing at Sentry with the existing search capabilities? What did the previous system look like? What was your design criteria for building a new platform? What was your initial list of possible system components and what was your evaluation process that resulted in your selection of Clickhouse? Can you describe the system architecture of Snuba and some of the ways that it differs from your initial ideas of how it would work? What have been some of the sharp edges of Clickhouse that you have had to engineer around? How have you found the operational aspects of Clickhouse? How did you manage the introduction of this new piece of infrastructure to a business that was already handling massive amounts of real-time data? What are some of the downstream benefits of using Clickhouse for managing event data at Sentry? For someone who is interested in using Snuba for their own purposes, how flexible is it for different domain contexts? What are some of the other data challenges that you are currently facing at Sentry? What is your next highest priority for evolving or rebuilding to address technical or business challenges? Contact Info James @JTCunning on Twitter JTCunning on GitHub Ted tkaemming on GitHub Website @tkaemming on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Sentry Podcast.__init__ Episode Snuba Blog Post Clickhouse Podcast Episode Disqus Urban Airship HBase Google Bigtable PostgreSQL Redis HyperLogLog Riak Celery RabbitMQ Apache Spark Presto Cassandra Apache Kudu Apache Pinot Apache Druid Flask Apache Kafka Cassandra Tombstone Sentry Blog XML Change Data Capture The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

App store banner

Play store banner