Data Engineering Podcast

Tobias Macey
undefined
Dec 21, 2021 • 55min

Fast And Flexible Headless Data Analytics With Cube.JS

Summary One of the perennial challenges of data analytics is having a consistent set of definitions, along with a flexible and performant API endpoint for querying them. In this episode Artom Keydunov and Pavel Tiunov share their work on Cube.js and the various ways that it is being used in the open source community. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Your host is Tobias Macey and today I’m interviewing Artyom Keydunov and Pavel Tiunov about Cube.js a framework for building analytics APIs to power your applications and BI dashboards Interview Introduction How did you get involved in the area of data management? Can you describe what Cube is and the story behind it? What are the main use cases and platform architectures that you are focused on? Who are the target personas that will be using and managing Cube.js? The name comes from the concept of an OLAP cube. Can you discuss the applications of OLAP cubes and their role in the current state of the data ecosystem? How does the idea of an OLAP cube compare to the recent focus on a dedicated metrics layer? What are the pieces of a data platform that might be replaced by Cube.js? Can you describe the design and architecture of the Cube platform? How has the focus and target use case for the Cube platform evolved since you first started working on it? One of the perpetually hard problems in computer science is cache management. How have you approached that challenge in the pre-aggregation layer of the Cube framework? What is your overarching design philosophy for the API of the Cube system? Can you talk through the workflow of someone building a cube and querying it from a downstream system? What do the iteration cycles look like as you go from initial proof of concept to a more sophisticated usage of Cube.js? What are some of the data modeling steps that are needed in the source systems? The perennial problem of embedding SQL into another host language or DSL is how to deal with validation and developer tooling. What are the utilities that you and the community have built to reduce friction while writing the definitions of a cube? What are the methods available for maintaining visibility across all of the cubes defined within and across installations of Cube.js? What are the opportunities for composing multiple cubes together to form a higher level aggregation? What are the most interesting, innovative, or unexpected ways that you have seen Cube.js used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Cube? When is Cube the wrong choice? What do you have planned for the future of Cube? Contact Info Artom keydunov on GitHub @keydunov on Twitter LinkedIn Pavel LinkedIn @paveltiunov87 on Twitter paveltiunov on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Cube.js Statsbot chart.js Highcharts D3 OLAP Cube dbt Superset Podcast Episode Streamlit Podcast.__init__ Episode Parquet Hasura kSQLDB Podcast Episode Materialize Podcast Episode Meltano Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Dec 20, 2021 • 1h 6min

Building A System Of Record For Your Organization's Data Ecosystem At Metaphor

Summary Building a well managed data ecosystem for your organization requires a holistic view of all of the producers, consumers, and processors of information. The team at Metaphor are building a fully connected metadata layer to provide both technical and social intelligence about your data. In this episode Pardhu Gunnam and Mars Lan explain how they have designed the architecture and user experience to allow everyone to collaborate on the data lifecycle and provide opportunities for automation and extensible workflows. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box. Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Your host is Tobias Macey and today I’m interviewing Pardhu Gunnam and Mars Lan about Metaphor Data, a platform aiming to be the system of record for your data ecosystem Interview Introduction How did you get involved in the area of data management? Can you describe what Metaphor is and the story behind it? On your site it states that you are aiming to be the "system of record" for your data platform. Can you unpack that statement and its implications? What are the shortcomings in the "data catalog" approach to metadata collection and presentation? Who are the target end users of Metaphor and what are the pain points for each persona that you are prioritizing? How has that focus informed your priorities for user experience design and feature development? Can you describe how the Metaphor platform is architected? What are the lessons that you learned from your work at DataHub that have informed your work on Metaphor? There has been a huge amount of focus on the "modern data stack" with an assumption that there is a cloud data warehouse as the central component that all data flows through. How does Metaphor’s design allow for usage in platforms that aren’t dominated by a cloud data warehouse? What are some examples of information that you can extract through integrations with an organization’s communication platforms? Can you talk through a few example workflows where that information is used to inform the actions taken by a team member? What is your philosophy around data modeling or schema standardization for metadata records? What are some of the challenges that teams face in stitching together a meaningful set of relations across metadata records in Metaphor? What are some of the features or potential use cases for Metaphor that are overlooked or misunderstood as you work with your customers? What are the most interesting, innovative, or unexpected ways that you have seen Metaphor used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Metaphor? When is Metaphor the wrong choice? What do you have planned for the future of Metaphor? Contact Info Pardhu LinkedIn @PardhuGunnam on Twitter Mars LinkedIn mars-lan on GitHub @mars_lan on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Metaphor The Modern Metadata Platform Why cant I find the right data? DataHub Transform Podcast Episode Supergrain MetriQL Podcast Episode dbt Podcast Interview OpenMetadata Podcast Interview Pegasus Data Language Modern Data Experience The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Dec 13, 2021 • 42min

Building Auditable Spark Pipelines At Capital One

Summary Spark is a powerful and battle tested framework for building highly scalable data pipelines. Because of its proven ability to handle large volumes of data Capital One has invested in it for their business needs. In this episode Gokul Prabagaren shares his use for it in calculating your rewards points, including the auditing requirements and how he designed his pipeline to maintain all of the necessary information through a pattern of data enrichment. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Your host is Tobias Macey and today I’m interviewing Gokul Prabagaren about how he is using Spark for real-world workflows at Capital One Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of the types of data and workflows that you are responsible for at Capital one? In terms of the three "V"s (Volume, Variety, Velocity), what is the magnitude of the data that you are working with? What are some of the business and regulatory requirements that have to be factored into the solutions that you design? Who are the consumers of the data assets that you are producing? Can you describe the technical elements of the platform that you use for managing your data pipelines? What are the various ways that you are using Spark at Capital One? You wrote a post and presented at the Databricks conference about your experience moving from a data filtering to a data enrichment pattern for segmenting transactions. Can you give some context as to the use case and what your design process was for the initial implementation? What were the shortcomings to that approach/business requirements which led you to refactoring the approach to one that maintained all of the data through the different processing stages? What are some of the impacts on data volumes and processing latencies working with enriched data frames persisted between task steps? What are some of the other optimizations or improvements that you have made to that pipeline since you wrote the post? What are some of the limitations of Spark that you have experienced during your work at Capital One? How have you worked around them? What are the most interesting, innovative, or unexpected ways that you have seen Spark used at Capital One? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data engineering at Capital One? What are some of the upcoming projects that you are focused on/excited for? How has your experience with the filtering vs. enrichment approach influenced your thinking on other projects that you work on? Contact Info @gocool_p on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Apache Spark Blog Post Databricks Presentation Delta Lake Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Dec 12, 2021 • 58min

Deliver Personal Experiences In Your Applications With The Unomi Open Source Customer Data Platform

Summary The core to providing your users with excellent service is to understand them and provide a personalized experience. Unfortunately many sites and applications take that to the extreme and collect too much information. In order to make it easier for developers to build customer profiles in a way that respects their privacy Serge Huber helped to create the Apache Unomi framework as an open source customer data platform. In this episode he explains how it can be used to build rich and useful profiles of your users, the system architecture that powers it, and some of the ways that it is being integrated into an organization’s broader data ecosystem. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box. Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Your host is Tobias Macey and today I’m interviewing Serge Huber about Apache Unomi, an open source customer data platform designed to manage customers, leads and visitors data and help personalize customers experiences Interview Introduction How did you get involved in the area of data management? Can you describe what Unomi is and the story behind it? What are the goals and target use cases of Unomi? What are the aspects of collecting and aggregating profile information that present challenges to developers? How does the design of Unomi reduce that burden? How does the focus of Unomi compare to systems such as Segment/Rudderstack or Optimizely for collecting user interactions and applying personalization? How does Unomi fit in the architecture of an application or data infrastructure? Can you describe how Unomi itself is architected? How have the goals and design of the project changed or evolved since it started? What are some of the most complex or challenging engineering projects that you have worked through? Can you describe the workflow of using Unomi to manage a set of customer profiles? What are some examples of user experience customization that you can build with Unomi? What are some alternative architectures that you have seen to produce similar capabilities? One of the interesting features of Unomi is the end-user profile management. What are some of the system and developer challenges that are introduced by that capability? (e.g. constraints on data manipulation, security, privacy concerns, etc.) How did Unomi manage privacy concerns and the GDPR ? How does Unomi help with the new third party data restrictions ? Why is access to raw data so important ? Could cloud providers offer Unomi as a service ? How have you used Unomi in your own work? What are the most interesting, innovative, or unexpected ways that you have seen Unomi used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Unomi? When is Unomi the wrong choice? What do you have planned for the future of Unomi? Contact Info LinkedIn @sergehuber on Twitter @bhillou on Twitter sergehuber on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Apache Unomi Jahia OASIS Open Foundation Segment Podcast Episode Rudderstack The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Dec 4, 2021 • 50min

Data Driven Hiring For Data Professionals With Alooba

Summary Hiring data professionals is challenging for a multitude of reasons, and as with every interview process there is a potential for bias to creep in. Tim Freestone founded Alooba to provide a more stable reference point for evaluating candidates to ensure that you can make more informed comparisons based on their actual knowledge. In this episode he explains how Alooba got started, how it is being used in the interview process for data oriented roles, and how it can also provide visibility into your organizations overall data literacy. The whole process of hiring is an important organizational skill to cultivate and this is an interesting exploration of the specific challenges involved in finding data professionals. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box. Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Your host is Tobias Macey and today I’m interviewing Tim Freestone about Alooba, an assessment platform for evaluating data and analytics candidates to improve hiring outcomes for data roles. Interview Introduction How did you get involved in the area of data management? Can you describe what Alooba is and the story behind it? What are the main goals that you are trying to achieve with Alooba? What are the main challenges that employers and candidates face when navigating their respective roles in the hiring process? What are some of the difficulties that are specific to data oriented roles? What are some of the complexities involved in designing a user experience that is positive and productive for both candidates and companies? What are some strategies that you have developed for establishing a fair and consistent baseline of skills to ensure consistent comparison across candidates? One of the problems that comes from test-based skills assessment is the implicit bias toward candidates who test well. How do you work to mitigate that in the candidate evaluation process? Can you describe how the Alooba platform itself is implemented? How have the goals and design of the system changed or evolved since you first started it? What are some of the ways that you use Alooba internally? How do you stay up to date with the evolving skill requirements as roles change and new roles are created? Beyond evaluation of candidates for hiring, what are some of the other features that you have added to Alooba to support organizations in their effort to gain value from their data? What are the most interesting, innovative, or unexpected ways that you have seen Alooba used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Alooba? When is Alooba the wrong choice? What do you have planned for the future of Alooba? Contact Info LinkedIn @timmyfreestone on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Alooba The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Dec 4, 2021 • 58min

Experimentation and A/B Testing For Modern Data Teams With Eppo

Summary A/B testing and experimentation are the most reliable way to determine whether a change to your product will have the desired effect on your business. Unfortunately, being able to design, deploy, and validate experiments is a complex process that requires a mix of technical capacity and organizational involvement which is hard to come by. Chetan Sharma founded Eppo to provide a system that organizations of every scale can use to reduce the burden of managing experiments so that you can focus on improving your business. In this episode he digs into the technical, statistical, and design requirements for running effective experiments and how he has architected the Eppo platform to make the process more accessible to business and data professionals. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Your host is Tobias Macey and today I’m interviewing Chetan Sharma about Eppo, a platform for building A/B experiments that are easier to manage Interview Introduction How did you get involved in the area of data management? Can you describe what Eppo is and the story behind it? What are some examples of the kinds of experiments that teams and organizations might want to conduct? What are the points of friction that What are the steps involved in designing, deploying, and analyzing the outcomes of an A/B experiment? What are some of the statistical errors that are common when conducting an experiment? What are the design and UX principles that you have focused on in Eppo to improve the workflow of building and analyzing experiments? Can you describe the system design of the Eppo platform? What are the services or capabilities external to Eppo that are required for it to be effective? What are the integration points for adding Eppo to an organization’s existing platform? Beyond the technical capabilities for running experiments there are a number of design requirements involved. Can you talk through some of the decisions that need to be made when deciding what to change and how to measure its impact? Another difficult element of managing experiments is understanding how they all interact with each other when running a large number of simultaneous tests. How does Eppo help with tracking the various experiments and the cohorts that are bucketed into each? What are some of the ideas or assumptions that you had about the technical and design aspects of running experiments that have been challenged or changed while building Eppo? What are the most interesting, innovative, or unexpected ways that you have seen Eppo used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Eppo? When is Eppo the wrong choice? What do you have planned for the future of Eppo? Contact Info LinkedIn @chesharma87 on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Eppo Knowledge Repo Apache Hive Frequentist Statistics Rudderstack The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Nov 27, 2021 • 59min

Creating A Unified Experience For The Modern Data Stack At Mozart Data

Summary The modern data stack has been gaining a lot of attention recently with a rapidly growing set of managed services for different stages of the data lifecycle. With all of the available options it is possible to run a scalable, production grade data platform with a small team, but there are still sharp edges and integration challenges to work through. Peter Fishman and Dan Silberman experienced these difficulties firsthand and created Mozart Data to provide a single, easy to use option for getting started with the modern data stack. In this episode they explain how they designed a user experience to make working with data more accessibly by organizations without a data team, while allowing for more advanced users to build out more complex workflows. They also share their thoughts on the modern data ecosystem and how it improves the availability of analytics for companies of all sizes. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Your host is Tobias Macey and today I’m interviewing Peter Fishman and Dan Silberman about Mozart Data and how they are building a unified experience for the modern data stack Interview Introduction How did you get involved in the area of data management? Can you describe what Mozart Data is and the story behind it? The promise of the "modern data stack" is that it’s all delivered as a service to make it easier to set up. What are the missing pieces that make something like Mozart necessary? What are the main workflows or industries that you are focusing on? Who are the main personas that you are building Mozart for? How has that combination of user persona and industry focus informed your decisions around feature priorities and user experience? Can you describe how you have architected the Mozart platform? How have you approached the build vs. buy decision internally? What are some of the most interesting or challenging engineering projects that you have had to work on while building Mozart? What are the stages of the data lifecycle that you work the hardest to automate, and which do you focus on exposing to customers? What are the edge cases in what customers might try to do in the bounds of Mozart, or areas where you have explicitly decided not to include in your features? What are the options for extensibility, or custom engineering when customers encounter those situations? What do you see as the next phase in the evolution of the data stack? What are the most interesting, innovative, or unexpected ways that you have seen Mozart used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Mozart? When is Mozart the wrong choice? What do you have planned for the future of Mozart? Contact Info Peter LinkedIn @peterfishman on Twitter Dan LinkedIn silberman on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Mozart Data Modern Data Stack Mode Analytics Fivetran Podcast Episode Snowflake Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Nov 27, 2021 • 59min

Doing DataOps For External Data Sources As A Service at Demyst

Summary The data that you have access to affects the questions that you can answer. By using external data sources you can drastically increase the range of analysis that is available to your organization. The challenge comes in all of the operational aspects of finding, accessing, organizing, and serving that data. In this episode Mark Hookey discusses how he and his team at Demyst do all of the DataOps for external data sources so that you don’t have to, including the systems necessary to organize and catalog the various collections that they host, the various serving layers to provide query interfaces that match your platform, and the utility of having a single place to access a multitude of information. If you are having trouble answering questions for your business with the data that you generate and collect internally, then it is definitely worthwhile to explore the information available from external sources. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box. Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Your host is Tobias Macey and today I’m interviewing Mark Hookey about Demyst Data, a platform for operationalizing external data Interview Introduction How did you get involved in the area of data management? Can you describe what Demyst is and the story behind it? What are the services and systems that you provide for organizations to incorporate external sources in their data workflows? Who are your target customers? What are some examples of data sets that an organization might want to use in their analytics? How are these different from SaaS data that an organization might integrate with tools such as Stitcher and Fivetran? What are some of the challenges that are introduced by working with these external data sets? If an organization isn’t using Demyst what are some of the technical and organizational systems that they will need to build and manage? Can you describe how the Demyst platform is architected? What have been the most complex or difficult engineering challenges that you have dealt with while building Demyst? Given the wide variance in the systems that your customers are running, what are some strategies that you have used to provide flexible APIs for accessing the underlying information? What is the process for you to identify and onboard a new data source in your platform? What are some of the additional analytical systems that you have to run to manage your business (e.g. usage metering and analytics, etc.)? What are the most interesting, innovative, or unexpected ways that you have seen Demyst used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Demyst? When is Demyst the wrong choice? What do you have planned for the future of Demyst? Contact Info LinkedIn Email Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Demyst Data LexisNexis AWS Athena DataRobot The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Nov 20, 2021 • 53min

Exploring Processing Patterns For Streaming Data Integration In Your Data Lake

Summary One of the perennial challenges posed by data lakes is how to keep them up to date as new data is collected. With the improvements in streaming engines it is now possible to perform all of your data integration in near real time, but it can be challenging to understand the proper processing patterns to make that performant. In this episode Ori Rafael shares his experiences from Upsolver and building scalable stream processing for integrating and analyzing data, and what the tradeoffs are when coming from a batch oriented mindset. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Your host is Tobias Macey and today I’m interviewing Ori Rafael about strategies for building stream and batch processing patterns for data lake analytics Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of the state of the market for data lakes today? What are the prevailing architectural and technological patterns that are being used to manage these systems? Batch and streaming systems have been used in various combinations since the early days of Hadoop. The Lambda architecture has largely been abandoned, so what is the answer for today’s data lakes? What are the challenges presented by streaming approaches to data transformations? The batch model for processing is intuitive despite its latency problems. What are the benefits that it provides? The core concept for data orchestration is the DAG. How does that manifest in a streaming context? In batch processing idempotent/immutable datasets are created by re-running the entire pipeline when logic changes need to be made. Given that there is no definitive start or end of a stream, what are the options for amending logical errors in transformations? What are some of the data processing/integration patterns that are impossible in a batch system? What are some useful strategies for migrating from a purely batch, or hybrid batch and streaming architecture, to a purely streaming system? What are some of the changes in technological or organizational patterns that are often overlooked or misunderstood in this shift? What are some of the most surprising things that you have learned about streaming systems in your time at Upsolver? What are the most interesting, innovative, or unexpected ways that you have seen streaming architectures used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on streaming data integration? When are streaming architectures the wrong approach? What do you have planned for the future of Upsolver to make streaming data easier to work with? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Upsolver Hive Metastore Hudi Podcast Episode Iceberg Podcast Episode Hadoop Lambda Architecture Kappa Architecture Apache Beam Event Sourcing Flink Podcast Episode Spark Structured Streaming The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
53 snips
Nov 20, 2021 • 1h 5min

Laying The Foundation Of Your Data Platform For The Era Of Big Complexity With Dagster

Summary The technology for scaling storage and processing of data has gone through massive evolution over the past decade, leaving us with the ability to work with massive datasets at the cost of massive complexity. Nick Schrock created the Dagster framework to help tame that complexity and scale the organizational capacity for working with data. In this episode he shares the journey that he and his team at Elementl have taken to understand the state of the ecosystem and how they can provide a foundational layer for a holistic data platform. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform and blazing fast NVMe storage there’s nothing slowing you down. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box. Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Your host is Tobias Macey and today I’m interviewing Nick Schrock about the evolution of Dagster and its path forward Interview Introduction How did you get involved in the area of data management? Can you describe what Dagster is and the story behind it? How has the project and community changed/evolved since we last spoke 2 years ago? How has the experience of the past 2 years clarified the challenges and opportunities that exist in the data ecosystem? What do you see as the foundational vs transient complexities that are germane to the industry? One of the emerging ideas in Dagster is the "software defined data asset" as the central entity in the framework. How has that shifted the way that engineers approach pipeline design and composition? How did that conceptual shift inform the accompanying refactor of the core principles in the framework? (jobs, ops, graphs) One of the powerful elements of the Dagster framework is the investment in rich metadata as a foundational principle. What are the opportunities for integrating and extending that context throughout the rest of an organizations data platform? What do you see as the potential for efforts such as OpenLineage and OpenMetadata to allow for other components in the data platform to create and propagate that context more freely? What are some of the project architecture/repository structure/pipeline composition patterns that have begun to form in the community and your own internal work with Dagster? What are some of the anti-patterns that you have seen users fall into when working with Dagster? Along with your recent refactoring of the core API you have also started to roll out the Dagster Cloud offering. What was your process for determining the path to commercialization for the Dagster project and community? How are you managing governance and long-term viability of the open source elements of Dagster? What are your design principles for deciding the boundaries between OSS and commercial features? What do you see as the role of Dagster in the creation of a data platform architecture? What are the opportunities that it creates for data platform engineers? What is your perspective on the tradeoffs of pipelines as software vs. pipelines as "code" vs. low/no-code pipelines? What (if any) option do you see for language agnostic/multi-language pipeline definitions in Dagster? What do you see as the biggest threats to the future success of Dagster/Elementl? You were a relative outsider to the data ecosystem when you first started Dagster/Elementl. What have been the most interesting and surprising experiences as you have invested your time and energy in contributing to the community? What are the most interesting, innovative, or unexpected ways that you have seen Dagster used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Dagster? When is Dagster the wrong choice? What do you have planned for the future of Dagster? Contact Info LinkedIn @schrockn on Twitter schrockn on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Elementl Series A Announcement Video on software-defined assets Dagster Podcast Episode GraphQL dbt Podcast Episode Open Source Data Stack Conference Meltano Podcast Episode Amundsen Podcast Episode DataHub Podcast Episode Hashicorp Vercel The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app