Data Engineering Podcast cover image

Data Engineering Podcast

Latest episodes

undefined
Jan 2, 2022 • 1h 1min

Creating Shared Context For Your Data Warehouse With A Controlled Vocabulary

Summary Communication and shared context are the hardest part of any data system. In recent years the focus has been on data catalogs as the means for documenting data assets, but those introduce a secondary system of record in order to find the necessary information. In this episode Emily Riederer shares her work to create a controlled vocabulary for managing the semantic elements of the data managed by her team and encoding it in the schema definitions in her data warehouse. She also explains how she created the dbtplyr package to simplify the work of creating and enforcing your own controlled vocabularies. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Emily Riederer about defining and enforcing column contracts and controlled vocabularies for your data warehouse Interview Introduction How did you get involved in the area of data management? Can you start by discussing some of the anti-patterns that you have encountered in data warehouse naming conventions and how it relates to the modeling approach? (e.g. star/snowflake schema, data vault, etc.) What are some of the types of contracts that can, and should, be defined and enforced in data workflows? What are the boundaries where we should think about establishing those contracts? What is the utility of column and table names for defining and enforcing contracts in analytical work? What is the process for establishing contractual elements in a naming schema? Who should be involved in that design process? Who are the participants in the communication paths for column naming contracts? What are some examples of context and details that can’t be captured in column names? What are some options for managing that additional information and linking it to the naming contracts? Can you describe the work that you have done with dbtplyr to make name contracts a supported construct in dbt projects? How does dbtplyr help in the creation and enforcement of contracts in the development of dbt workflows How are you using dbtplyr in your own work? How do you handle the work of building transformations to make data comply with contracts? What are the supplemental systems/techniques/documentation to work with name contracts and how they are leveraged by downstream consumers? What are the most interesting, innovative, or unexpected ways that you have seen naming contracts and/or dbtplyr used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on dbtplyr? When is dbtplyr the wrong choice? What do you have planned for the future of dbtplyr? Contact Info Twitter Website Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links dbtplyr Great Expectations Podcast Episode Controlled Vocabularies Presentation dplyr Data Vault Podcast Episode OpenMetadata Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
4 snips
Jan 2, 2022 • 1h 3min

A Reflection On The Data Ecosystem For The Year 2021

Summary This has been an active year for the data ecosystem, with a number of new product categories and substantial growth in existing areas. In an attempt to capture the zeitgeist Maura Church, David Wallace, Benn Stancil, and Gleb Mezhanskiy join the show to reflect on the past year and share their thought son the year to come. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box. Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Your host is Tobias Macey and today I’m interviewing Maura Church, David Wallace, Benn Stancil, and Gleb Mezhanskiy about the key themes of 2021 in the data ecosystem and what to expect for next year Interview Introduction How did you get involved in the area of data management? What were the main themes that you saw data practitioners and vendors focused on this year? What is the major bottleneck for Data teams in 2021? Will it be the same in 2022? One of the ways to reason about progress in any domain is to look at what was the primary bottleneck of further progress (data adoption for decision making) at different points in time. In the data domain, we have seen a number of bottlenecks, for example, scaling data platforms, the answer to which was Hadoop and on-prem columnar stores and then cloud data warehouses such as Snowflake & BigQuery. Then the problem was data integration and transformation which was solved by data integration vendors and frameworks such as Fivetran / Airbyte, modern orchestration frameworks such as Dagster & dbt and “reverse-ETL” Hightouch. What is the main challenge now? Will SQL be challenged as a primary interface to analytical data? In 2020 we’ve seen a few launches of post-SQL languages such as Malloy, Preql, metric layer query languages from Transform and Supergrain. To what extent does speed matter? Over the past couple of months, we’ve seen the resurgence of “benchmark wars” between major data warehousing platforms. To what extent do speed benchmarks inform decisions for modern data teams? How important is query speed in a modern data workflow? What needs to be true about your current DWH solution and potential alternatives to make a move? How has the way data teams work been changing? In 2020 remote seemed like a temporary emergency state. In 2021, it went mainstream. How has that affected the day-to-day of data teams, how they collaborate internally and with stakeholders? What’s it like to be a data vendor in 2021? Vertically integrated vs. modular data stack? There are multiple forces in play. Will the stack continue to be fragmented? Will we see major consolidation? If so, in which parts of the stack? Contact Info Maura LinkedIn Website @outoftheverse on Twitter David LinkedIn @davidjwallace on Twitter dwallace0723 on GitHub Benn LinkedIn @bennstancil on Twitter Gleb LinkedIn @glebmm on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Patreon Dutchie Mode Analytics Datafold Podcast Episode Locally Optimistic RJ Metrics Stitch Mozart Data Podcast Episode Dagster Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
13 snips
Dec 27, 2021 • 1h 11min

Revisiting The Technical And Social Benefits Of The Data Mesh

Summary The data mesh is a thesis that was presented to address the technical and organizational challenges that businesses face in managing their analytical workflows at scale. Zhamak Dehghani introduced the concepts behind this architectural patterns in 2019, and since then it has been gaining popularity with many companies adopting some version of it in their systems. In this episode Zhamak re-joins the show to discuss the real world benefits that have been seen, the lessons that she has learned while working with her clients and the community, and her vision for the future of the data mesh. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Your host is Tobias Macey and today I’m welcoming back Zhamak Dehghani to talk about her work on the data mesh book and the lessons learned over the past 2 years Interview Introduction How did you get involved in the area of data management? Can you start by giving a brief recap of the principles of the data mesh and the story behind it? How has your view of the principles of the data mesh changed since our conversation in July of 2019? What are some of the ways that your work on the data mesh book influenced your thinking on the practical elements of implementing a data mesh? What do you view as the as-yet-unknown elements of the technical and social design constructs that are needed for a sustainable data mesh implementation? In the opening of your book you state that "Data Mesh is a new approach in sourcing, managing, and accessing data for analytical use cases at scale". As with everything, scale is subjective, but what are some of the heuristics that you rely on for determining when a data mesh is an appropriate solution? What are some of the ways that data mesh concepts manifest at the boundaries of organizations? While the idea of federated access to data product quanta reduces the amount of coordination necessary at the organizational level, it raises the spectre of more complex logic required for consumers of multiple quanta. How can data mesh implementations mitigate the impact of this problem? What are some of the technical components that you have found to be best suited to the implementation of data elements within a mesh? What are the technological components that are still missing for a mesh-native data platform? How should an organization that wishes to implement a mesh style architecture think about the roles and skills that they will need on staff? How can vendors factor into the solution? What is the role of application developers in a data mesh ecosystem and how do they need to change their thinking around the interfaces that they provide in their products? What are the most interesting, innovative, or unexpected ways that you have seen data mesh principles used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data mesh implementations? When is a data mesh the wrong approach? What do you think the future of the data mesh will look like? Contact Info LinkedIn @zhamakd on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Data Engineering Podcast Data Mesh Interview Data Mesh Book Thoughtworks Expert Systems OpenLineage Podcast Episode Data Mesh Learning The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Dec 27, 2021 • 58min

Exploring The Evolving Role Of Data Engineers

Summary Data Engineering is still a relatively new field that is going through a continued evolution as new technologies are introduced and new requirements are understood. In this episode Maxime Beauchemin returns to revisit what it means to be a data engineer and how the role has changed over the past 5 years. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box. Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Your host is Tobias Macey and today I’m interviewing Maxime Beauchemin about the impacts that the evolution of the modern data stack has had on the role and responsibilities of data engineers Interview Introduction How did you get involved in the area of data management? What is your current working definition of a data engineer? How has that definition changed since your article on the "rise of the data engineer" and episode 3 of this show about "defining data engineering"? How has the growing availability of data infrastructure services shifted foundational skills and knowledge that are necessary to be effective? How should a new/aspiring data engineer focus their time and energy to become effective? One of the core themes in this current spate of technologies is "democratization of data". In your post on the downfall of the data engineer you called out the pressure on data engineers to maintain control with so many contributors with varying levels of skill and understanding. How well is the "modern data stack" balancing these concerns? An interesting impact of the growing usage of data is the constrained availability of data engineers. How do you see the effects of the job market on driving evolution of tooling and services? With the explosion of tools and services for working with data, a new problem has evolved of which ones to use for a given organization. What do you see as an effective and efficient process for enumerating and evaluating the available components for building a stack? There is also a lot of conversation around the "modern data stack", as well as the need for companies to build a "data platform". What (if any) difference do you see in the implications of those phrases and the skills required to compile a stack vs build a platform? How do you view the long term viability of templated SQL as a core workflow for transformations? What is the impact of more acessible and widespread machine learning/deep learning on data engineers/data infrastructure? How evenly distributed across industries and geographies are the advances in data infrastructure and engineering practices? What are some of the opportunities that are being missed or squandered during this dramatic shift in the data engineering landscape? What are the most interesting, innovative, or unexpected ways that you have seen the data ecosytem evolve? What are the most interesting, unexpected, or challenging lessons that you have learned while contributing to and participating in the data ecosystem? In episode 3 of this show (almost five years ago) we closed with some predictions for the following years of data engineering, many of which have been proven out. What is your retrospective on those claims, and what are your new predictions for the upcoming years? Contact Info LinkedIn @mistercrunch on Twitter mistercrunch on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links How the Modern Data Stack is Reshaping Data Engineering The Rise of the Data Engineer The Downfall of the Data Engineer Defining Data Engineering – Data Engineering Podcast Airflow Superset Podcast Episode Preset Fivetran Podcast Episode Meltano Podcast Episode Airbyte Podcast Episode Ralph Kimball Bill Inmon Feature Store Prophecy.io Podcast Episode Ab Initio Dremio Podcast Episode Data Mesh Podcast Episode Firebolt Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Dec 21, 2021 • 55min

Fast And Flexible Headless Data Analytics With Cube.JS

Summary One of the perennial challenges of data analytics is having a consistent set of definitions, along with a flexible and performant API endpoint for querying them. In this episode Artom Keydunov and Pavel Tiunov share their work on Cube.js and the various ways that it is being used in the open source community. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Your host is Tobias Macey and today I’m interviewing Artyom Keydunov and Pavel Tiunov about Cube.js a framework for building analytics APIs to power your applications and BI dashboards Interview Introduction How did you get involved in the area of data management? Can you describe what Cube is and the story behind it? What are the main use cases and platform architectures that you are focused on? Who are the target personas that will be using and managing Cube.js? The name comes from the concept of an OLAP cube. Can you discuss the applications of OLAP cubes and their role in the current state of the data ecosystem? How does the idea of an OLAP cube compare to the recent focus on a dedicated metrics layer? What are the pieces of a data platform that might be replaced by Cube.js? Can you describe the design and architecture of the Cube platform? How has the focus and target use case for the Cube platform evolved since you first started working on it? One of the perpetually hard problems in computer science is cache management. How have you approached that challenge in the pre-aggregation layer of the Cube framework? What is your overarching design philosophy for the API of the Cube system? Can you talk through the workflow of someone building a cube and querying it from a downstream system? What do the iteration cycles look like as you go from initial proof of concept to a more sophisticated usage of Cube.js? What are some of the data modeling steps that are needed in the source systems? The perennial problem of embedding SQL into another host language or DSL is how to deal with validation and developer tooling. What are the utilities that you and the community have built to reduce friction while writing the definitions of a cube? What are the methods available for maintaining visibility across all of the cubes defined within and across installations of Cube.js? What are the opportunities for composing multiple cubes together to form a higher level aggregation? What are the most interesting, innovative, or unexpected ways that you have seen Cube.js used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Cube? When is Cube the wrong choice? What do you have planned for the future of Cube? Contact Info Artom keydunov on GitHub @keydunov on Twitter LinkedIn Pavel LinkedIn @paveltiunov87 on Twitter paveltiunov on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Cube.js Statsbot chart.js Highcharts D3 OLAP Cube dbt Superset Podcast Episode Streamlit Podcast.__init__ Episode Parquet Hasura kSQLDB Podcast Episode Materialize Podcast Episode Meltano Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Dec 20, 2021 • 1h 6min

Building A System Of Record For Your Organization's Data Ecosystem At Metaphor

Summary Building a well managed data ecosystem for your organization requires a holistic view of all of the producers, consumers, and processors of information. The team at Metaphor are building a fully connected metadata layer to provide both technical and social intelligence about your data. In this episode Pardhu Gunnam and Mars Lan explain how they have designed the architecture and user experience to allow everyone to collaborate on the data lifecycle and provide opportunities for automation and extensible workflows. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box. Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Your host is Tobias Macey and today I’m interviewing Pardhu Gunnam and Mars Lan about Metaphor Data, a platform aiming to be the system of record for your data ecosystem Interview Introduction How did you get involved in the area of data management? Can you describe what Metaphor is and the story behind it? On your site it states that you are aiming to be the "system of record" for your data platform. Can you unpack that statement and its implications? What are the shortcomings in the "data catalog" approach to metadata collection and presentation? Who are the target end users of Metaphor and what are the pain points for each persona that you are prioritizing? How has that focus informed your priorities for user experience design and feature development? Can you describe how the Metaphor platform is architected? What are the lessons that you learned from your work at DataHub that have informed your work on Metaphor? There has been a huge amount of focus on the "modern data stack" with an assumption that there is a cloud data warehouse as the central component that all data flows through. How does Metaphor’s design allow for usage in platforms that aren’t dominated by a cloud data warehouse? What are some examples of information that you can extract through integrations with an organization’s communication platforms? Can you talk through a few example workflows where that information is used to inform the actions taken by a team member? What is your philosophy around data modeling or schema standardization for metadata records? What are some of the challenges that teams face in stitching together a meaningful set of relations across metadata records in Metaphor? What are some of the features or potential use cases for Metaphor that are overlooked or misunderstood as you work with your customers? What are the most interesting, innovative, or unexpected ways that you have seen Metaphor used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Metaphor? When is Metaphor the wrong choice? What do you have planned for the future of Metaphor? Contact Info Pardhu LinkedIn @PardhuGunnam on Twitter Mars LinkedIn mars-lan on GitHub @mars_lan on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Metaphor The Modern Metadata Platform Why cant I find the right data? DataHub Transform Podcast Episode Supergrain MetriQL Podcast Episode dbt Podcast Interview OpenMetadata Podcast Interview Pegasus Data Language Modern Data Experience The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Dec 13, 2021 • 42min

Building Auditable Spark Pipelines At Capital One

Summary Spark is a powerful and battle tested framework for building highly scalable data pipelines. Because of its proven ability to handle large volumes of data Capital One has invested in it for their business needs. In this episode Gokul Prabagaren shares his use for it in calculating your rewards points, including the auditing requirements and how he designed his pipeline to maintain all of the necessary information through a pattern of data enrichment. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Your host is Tobias Macey and today I’m interviewing Gokul Prabagaren about how he is using Spark for real-world workflows at Capital One Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of the types of data and workflows that you are responsible for at Capital one? In terms of the three "V"s (Volume, Variety, Velocity), what is the magnitude of the data that you are working with? What are some of the business and regulatory requirements that have to be factored into the solutions that you design? Who are the consumers of the data assets that you are producing? Can you describe the technical elements of the platform that you use for managing your data pipelines? What are the various ways that you are using Spark at Capital One? You wrote a post and presented at the Databricks conference about your experience moving from a data filtering to a data enrichment pattern for segmenting transactions. Can you give some context as to the use case and what your design process was for the initial implementation? What were the shortcomings to that approach/business requirements which led you to refactoring the approach to one that maintained all of the data through the different processing stages? What are some of the impacts on data volumes and processing latencies working with enriched data frames persisted between task steps? What are some of the other optimizations or improvements that you have made to that pipeline since you wrote the post? What are some of the limitations of Spark that you have experienced during your work at Capital One? How have you worked around them? What are the most interesting, innovative, or unexpected ways that you have seen Spark used at Capital One? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data engineering at Capital One? What are some of the upcoming projects that you are focused on/excited for? How has your experience with the filtering vs. enrichment approach influenced your thinking on other projects that you work on? Contact Info @gocool_p on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Apache Spark Blog Post Databricks Presentation Delta Lake Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Dec 12, 2021 • 58min

Deliver Personal Experiences In Your Applications With The Unomi Open Source Customer Data Platform

Summary The core to providing your users with excellent service is to understand them and provide a personalized experience. Unfortunately many sites and applications take that to the extreme and collect too much information. In order to make it easier for developers to build customer profiles in a way that respects their privacy Serge Huber helped to create the Apache Unomi framework as an open source customer data platform. In this episode he explains how it can be used to build rich and useful profiles of your users, the system architecture that powers it, and some of the ways that it is being integrated into an organization’s broader data ecosystem. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box. Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Your host is Tobias Macey and today I’m interviewing Serge Huber about Apache Unomi, an open source customer data platform designed to manage customers, leads and visitors data and help personalize customers experiences Interview Introduction How did you get involved in the area of data management? Can you describe what Unomi is and the story behind it? What are the goals and target use cases of Unomi? What are the aspects of collecting and aggregating profile information that present challenges to developers? How does the design of Unomi reduce that burden? How does the focus of Unomi compare to systems such as Segment/Rudderstack or Optimizely for collecting user interactions and applying personalization? How does Unomi fit in the architecture of an application or data infrastructure? Can you describe how Unomi itself is architected? How have the goals and design of the project changed or evolved since it started? What are some of the most complex or challenging engineering projects that you have worked through? Can you describe the workflow of using Unomi to manage a set of customer profiles? What are some examples of user experience customization that you can build with Unomi? What are some alternative architectures that you have seen to produce similar capabilities? One of the interesting features of Unomi is the end-user profile management. What are some of the system and developer challenges that are introduced by that capability? (e.g. constraints on data manipulation, security, privacy concerns, etc.) How did Unomi manage privacy concerns and the GDPR ? How does Unomi help with the new third party data restrictions ? Why is access to raw data so important ? Could cloud providers offer Unomi as a service ? How have you used Unomi in your own work? What are the most interesting, innovative, or unexpected ways that you have seen Unomi used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Unomi? When is Unomi the wrong choice? What do you have planned for the future of Unomi? Contact Info LinkedIn @sergehuber on Twitter @bhillou on Twitter sergehuber on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Apache Unomi Jahia OASIS Open Foundation Segment Podcast Episode Rudderstack The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Dec 4, 2021 • 50min

Data Driven Hiring For Data Professionals With Alooba

Summary Hiring data professionals is challenging for a multitude of reasons, and as with every interview process there is a potential for bias to creep in. Tim Freestone founded Alooba to provide a more stable reference point for evaluating candidates to ensure that you can make more informed comparisons based on their actual knowledge. In this episode he explains how Alooba got started, how it is being used in the interview process for data oriented roles, and how it can also provide visibility into your organizations overall data literacy. The whole process of hiring is an important organizational skill to cultivate and this is an interesting exploration of the specific challenges involved in finding data professionals. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box. Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Your host is Tobias Macey and today I’m interviewing Tim Freestone about Alooba, an assessment platform for evaluating data and analytics candidates to improve hiring outcomes for data roles. Interview Introduction How did you get involved in the area of data management? Can you describe what Alooba is and the story behind it? What are the main goals that you are trying to achieve with Alooba? What are the main challenges that employers and candidates face when navigating their respective roles in the hiring process? What are some of the difficulties that are specific to data oriented roles? What are some of the complexities involved in designing a user experience that is positive and productive for both candidates and companies? What are some strategies that you have developed for establishing a fair and consistent baseline of skills to ensure consistent comparison across candidates? One of the problems that comes from test-based skills assessment is the implicit bias toward candidates who test well. How do you work to mitigate that in the candidate evaluation process? Can you describe how the Alooba platform itself is implemented? How have the goals and design of the system changed or evolved since you first started it? What are some of the ways that you use Alooba internally? How do you stay up to date with the evolving skill requirements as roles change and new roles are created? Beyond evaluation of candidates for hiring, what are some of the other features that you have added to Alooba to support organizations in their effort to gain value from their data? What are the most interesting, innovative, or unexpected ways that you have seen Alooba used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Alooba? When is Alooba the wrong choice? What do you have planned for the future of Alooba? Contact Info LinkedIn @timmyfreestone on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Alooba The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Dec 4, 2021 • 58min

Experimentation and A/B Testing For Modern Data Teams With Eppo

Summary A/B testing and experimentation are the most reliable way to determine whether a change to your product will have the desired effect on your business. Unfortunately, being able to design, deploy, and validate experiments is a complex process that requires a mix of technical capacity and organizational involvement which is hard to come by. Chetan Sharma founded Eppo to provide a system that organizations of every scale can use to reduce the burden of managing experiments so that you can focus on improving your business. In this episode he digs into the technical, statistical, and design requirements for running effective experiments and how he has architected the Eppo platform to make the process more accessible to business and data professionals. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Your host is Tobias Macey and today I’m interviewing Chetan Sharma about Eppo, a platform for building A/B experiments that are easier to manage Interview Introduction How did you get involved in the area of data management? Can you describe what Eppo is and the story behind it? What are some examples of the kinds of experiments that teams and organizations might want to conduct? What are the points of friction that What are the steps involved in designing, deploying, and analyzing the outcomes of an A/B experiment? What are some of the statistical errors that are common when conducting an experiment? What are the design and UX principles that you have focused on in Eppo to improve the workflow of building and analyzing experiments? Can you describe the system design of the Eppo platform? What are the services or capabilities external to Eppo that are required for it to be effective? What are the integration points for adding Eppo to an organization’s existing platform? Beyond the technical capabilities for running experiments there are a number of design requirements involved. Can you talk through some of the decisions that need to be made when deciding what to change and how to measure its impact? Another difficult element of managing experiments is understanding how they all interact with each other when running a large number of simultaneous tests. How does Eppo help with tracking the various experiments and the cohorts that are bucketed into each? What are some of the ideas or assumptions that you had about the technical and design aspects of running experiments that have been challenged or changed while building Eppo? What are the most interesting, innovative, or unexpected ways that you have seen Eppo used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Eppo? When is Eppo the wrong choice? What do you have planned for the future of Eppo? Contact Info LinkedIn @chesharma87 on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Eppo Knowledge Repo Apache Hive Frequentist Statistics Rudderstack The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner
Get the app