Data Engineering Podcast

Tobias Macey

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Episodes

Mentioned books

Oct 8, 2021 • 44min

Make Your Business Metrics Reusable With Open Source Headless BI Using Metriql

Summary The key to making data valuable to business users is the ability to calculate meaningful metrics and explore them along useful dimensions. Business intelligence tools have provided this capability for years, but they don’t offer a means of exposing those metrics to other systems. Metriql is an open source project that provides a headless BI system where you can define your metrics and share them with all of your other processes. In this episode Burak Kabakcı shares the story behind the project, how you can use it to create your metrics definitions, and the benefits of treating the semantic layer as a dedicated component of your platform. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. Your host is Tobias Macey and today I’m interviewing Burak Emre Kabakcı about Metriql, a headless BI and metrics layer for your data stack Interview Introduction How did you get involved in the area of data management? Can you describe what Metriql is and the story behind it? What are the characteristics and benefits of a "headless BI" system? What was your motivation to create and open-source Metriql as an independent project outside of your business? How are you approaching governance and sustainability of the project? How does Metriql compare to projects such as AirBnB’s Minerva or Transform’s platform? How does the industry/vertical of a business impact their ability to benefit from a metrics layer/headless BI? What are the limitations to the logical complexity that can be applied to the calculation of a given metric/set of metrics? Can you describe how Metriql is implemented? How have the design and goals of the project changed or evolved since you began working on it? What are the most complex/difficult engineering elements of building a metrics layer? Can you describe the workflow of defining metrics? What have been your guiding principles in defining the user experience for working with metriql? What are the opportunities for including business users in the definition of metrics? (e.g. pushing down/generating definitions from a BI layer) What are the biggest challenges and limitations of creating metrics definitions purely in SQL? What are the options for exposing metrics back to the warehouse and other operational systems such as reverse ETL vendors? What are the missing elements in the data ecosystem for taking full advantage of a headless BI/metrics layer? What are the most interesting, innovative, or unexpected ways that you have seen Metriql used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Metriql? When is Metriql the wrong choice? What do you have planned for the future of Metriql? Contact Info LinkedIn Website buremba on GitHub @bu7emba on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Metriql Rakam Hazelcast Headless BI Google Data Studio Superset Podcast Episode Podcast.__init__ Episode Trino Podcast Episode Supergrain The Missing Piece Of The Modern Data Stack article by Benn Stancil Metabase Podcast Episode dbt Podcast Episode dbt-metabase re_data OpenMetadata The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Oct 6, 2021 • 46min

Adding Support For Distributed Transactions To The Redpanda Streaming Engine

Summary Transactions are a necessary feature for ensuring that a set of actions are all performed as a single unit of work. In streaming systems this is necessary to ensure that a set of messages or transformations are all executed together across different queues. In this episode Denis Rystsov explains how he added support for transactions to the Redpanda streaming engine. He discusses the use cases for transactions, the different strategies, semantics, and guarantees that they might need to support, and how his implementation ended up improving the performance of bulk write operations. This is an interesting deep dive into the internals of a high performance streaming engine and the details that are involved in building distributed systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Denis Rystsov about implementing transactions in the RedPanda streaming engine Interview Introduction How did you get involved in the area of data management? Can you quickly recap what RedPanda is and the goals of the project? What are the use cases for transactions in a pub/sub messaging system? What are the elements of streaming systems that make atomic transactions a complex problem? What was the motivation for starting down the path of adding transactions to the RedPanda engine? How did the constraint of supporting the Kafka API influence your implementation strategy for transaction semantics? Can you talk through the details of how you ended up implementing transactions in RedPanda? What are some of the roadblocks and complexities that you encountered while working through the implementation? How did you approach the validation and verification of the transactions? What other features or capabilities are you planning to work on next? What are the most interesting, innovative, or unexpected ways that you have seen transactions in RedPanda used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on transactions for RedPanda? When are transactions the wrong choice? What do you have planned for the future of transaction support in RedPanda? Contact Info @rystsov on twitter LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Vectorized RedPanda Podcast Episode RedPanda Transactions Post Yandex Cassandra MongoDB Riak Cosmos DB Jepsen Podcast Episode Testing Shared Memories paper Journal of Systems Research Kafka Pulsar Seastar Framework CockroachDB Podcast Episode TiDB Calvin Paper Polyjuice Paper Parallel Commit Chaos Testing Matchmaker Paxos Algorithm The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Oct 2, 2021 • 1h 8min

Building Real-Time Data Platforms For Large Volumes Of Information With Aerospike

Summary Aerospike is a database engine that is designed to provide millisecond response times for queries across terabytes or petabytes. In this episode Chief Strategy Officer, Lenley Hensarling, explains how the ability to process these large volumes of information in real-time allows businesses to unlock entirely new capabilities. He also discusses the technical implementation that allows for such extreme performance and how the data model contributes to the scalability of the system. If you need to deal with massive data, at high velocities, in milliseconds, then Aerospike is definitely worth learning about. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold’s proactive approach to data quality helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Your host is Tobias Macey and today I’m interviewing Lenley Hensarling about Aerospike and building real-time data platforms Interview Introduction How did you get involved in the area of data management? Can you describe what Aerospike is and the story behind it? What are the use cases that it is uniquely well suited for? What are the use cases that you and the Aerospike team are focusing on and how does that influence your focus on priorities of feature development and user experience? What are the driving factors for building a real-time data platform? How is Aerospike being incorporated in application and data architectures? Can you describe how the Aerospike engine is architected? How have the design and architecture changed or evolved since it was first created? How have market forces influenced the product priorities and focus? What are the challenges that end users face when determining how to model their data given a key/value storage interface? What are the abstraction layers that you and/or your users build to manage reliational or hierarchical data architectures? What are the operational characteristics of the Aerospike system? (e.g. deployment, scaling, CP vs AP, upgrades, clustering, etc.) What are the most interesting, innovative, or unexpected ways that you have seen Aerospike used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Aerospike? When is Aerospike the wrong choice? What do you have planned for the future of Aerospike? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Aerospike GitHub EnterpriseDB "Nobody Expects The Spanish Inquisition" ARM CPU Architectures AWS Graviton Processors The Datacenter Is The Computer (Affiliate link) Jepsen Tests Podcast Episode Cloud Native Computing Foundation Prometheus Grafana OpenTelemetry Podcast.__init__ Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Sep 30, 2021 • 1h 12min

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

Data Engineering Podcast

Episodes

Mentioned books

Make Your Business Metrics Reusable With Open Source Headless BI Using Metriql

Adding Support For Distributed Transactions To The Redpanda Streaming Engine

Building Real-Time Data Platforms For Large Volumes Of Information With Aerospike

Delivering Your Personal Data Cloud With Prifina

Digging Into Data Reliability Engineering

Massively Parallel Data Processing In Python Without The Effort Using Bodo

Declarative Machine Learning Without The Operational Overhead Using Continual

An Exploration Of The Data Engineering Requirements For Bioinformatics

Setting The Stage For The Next Chapter Of The Cassandra Database

A View From The Round Table Of Gartner's Cool Vendors

The AI-powered Podcast Player