Data Engineering Podcast

Tobias Macey
undefined
Feb 23, 2021 • 52min

Self Service Open Source Data Integration With AirByte

Summary Data integration is a critical piece of every data pipeline, yet it is still far from being a solved problem. There are a number of managed platforms available, but the list of options for an open source system that supports a large variety of sources and destinations is still embarrasingly short. The team at Airbyte is adding a new entry to that list with the goal of making robust and easy to use data integration more accessible to teams who want or need to maintain full control of their data. In this episode co-founders John Lafleur and Michel Tricot share the story of how and why they created Airbyte, discuss the project’s design and architecture, and explain their vision of what an open soure data integration platform should offer. If you are struggling to maintain your extract and load pipelines or spending time on integrating with a new system when you would prefer to be working on other projects then this is definitely a conversation worth listening to. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Your host is Tobias Macey and today I’m interviewing Michel Tricot and John Lafleur about Airbyte, an open source framework for building data integration pipelines. Interview Introduction How did you get involved in the area of data management? Can you start by explaining what Airbyte is and the story behind it? Businesses and data engineers have a variety of options for how to manage their data integration. How would you characterize the overall landscape and how does Airbyte distinguish itself in that space? How would you characterize your target users? How have those personas instructed the priorities and design of Airbyte? What do you see as the benefits and tradeoffs of a UI oriented data integration platform as compared to a code first approach? what are the complex/challenging elements of data integration that makes it such a slippery problem? motivation for creating open source ELT as a business Can you describe how the Airbyte platform is implemented? What was your motivation for choosing Java as the primary language? incidental complexity of forcing all connectors to be packaged as containers shortcomings of the Singer specification/motivation for creating a backwards incompatible interface perceived potential for community adoption of Airbyte specification tradeoffs of using JSON as interchange format vs. e.g. protobuf/gRPC/Avro/etc. information lost when converting records to JSON types/how to preserve that information (e.g. field constraints, valid enums, etc.) interfaces/extension points for integrating with other tools, e.g. Dagster abstraction layers for simplifying implementation of new connectors tradeoffs of storing all connectors in a monorepo with the Airbyte core impact of community adoption/contributions What is involved in setting up an Airbyte installation? What are the available axes for scaling an Airbyte deployment? challenges of setting up and maintaining CI environment for Airbyte How are you managing governance and long term sustainability of the project? What are some of the most interesting, unexpected, or innovative ways that you have seen Airbyte used? What are the most interesting, unexpected, or challenging lessons that you have learned while building Airbyte? When is Airbyte the wrong choice? What do you have planned for the future of the project? Contact Info Michel LinkedIn @MichelTricot on Twitter michel-tricot on GitHub John LinkedIn @JeanLafleur on Twitter johnlafleur on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Airbyte Liveramp Fivetran Podcast Episode Stitch Data Matillion DataCoral Podcast Episode Singer Meltano Podcast Episode Airflow Podcast.__init__ Episode Kotlin Docker Monorepo Airbyte Specification Great Expectations Podcast Episode Dagster Data Engineering Podcast Episode Podcast.__init__ Episode Prefect Podcast Episode DBT Podcast Episode Kubernetes Snowflake Podcast Episode Redshift Presto Spark Parquet Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Feb 16, 2021 • 52min

Building The Foundations For Data Driven Businesses at 5xData

Tarush Aggarwal, Founder of 5xData, discusses the core elements necessary for businesses to be data-driven. He emphasizes the importance of building foundational capabilities, offering collaborative workshops to assist in setting up technical and organizational systems. He also highlights the ongoing support provided through mastermind groups and the initial steps for making data-informed decisions.
undefined
10 snips
Feb 9, 2021 • 47min

How Shopify Is Building Their Production Data Warehouse Using DBT

Summary With all of the tools and services available for building a data platform it can be difficult to separate the signal from the noise. One of the best ways to get a true understanding of how a technology works in practice is to hear from people who are running it in production. In this episode Zeeshan Qureshi and Michelle Ark share their experiences using DBT to manage the data warehouse for Shopify. They explain how the structured the project to allow for multiple teams to collaborate in a scalable manner, the additional tooling that they added to address the edge cases that they have run into, and the optimizations that they baked into their continuous integration process to provide fast feedback and reduce costs. This is a great conversation about the lessons learned from real world use of a specific technology and how well it lives up to its promises. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. Today’s episode of Data Engineering Podcast is sponsored by Datadog, the monitoring and analytics platform for cloud-scale infrastructure and applications. Datadog’s machine-learning based alerts, customizable dashboards, and 400+ vendor-backed integrations makes it easy to unify disparate data sources and pivot between correlated metrics and events for faster troubleshooting. By combining metrics, traces, and logs in one place, you can easily improve your application performance. Try Datadog free by starting a your 14-day trial and receive a free t-shirt once you install the agent. Go to dataengineeringpodcast.com/datadog today see how you can unify your monitoring today. Your host is Tobias Macey and today I’m interviewing Zeeshan Qureshi and Michelle Ark about how Shopify is building their production data warehouse platform with DBT Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of what the Shopify platform is? What kinds of data sources are you working with? Can you share some examples of the types of analysis, decisions, and products that you are building with the data that you manage? How have you structured your data teams to be able to deliver those projects? What are the systems that you have in place, technological or otherwise, to allow you to support the needs of the various data professionals and business users? What was the tipping point that led you to reconsider your system design and start down the road of architecting a data warehouse? What were your criteria when selecting a platform for your data warehouse? What decision did that criteria lead you to make? Once you decided to orient a large portion of your reporting around a data warehouse, what were the biggest unknowns that you were faced with while deciding how to structure the workflows and access policies? What were your criteria for determining what toolchain to use for managing the data warehouse? You ultimately decided to standardize on DBT. What were the other options that you explored and what were the requirements that you had for determining the candidates? What was your process for onboarding users into the DBT toolchain and determining how to structure the project layout? What are some of the shortcomings or edge cases that you ran into? Rather than rely on the vanilla DBT workflow you created a wrapper project to add additional functionality. What were some of the features that you needed to add to suit your particular needs? What has been your experience with extending and integrating with DBT to customize it for your environment? Can you talk through how you manage testing of your DBT pipelines and the tables that it is responsible for? How much of the testing are you able to do with out-of-the-box functionality from DBT? What are the additional capabilities that you have bolted on to provide a more robust and scalable means of verifying your pipeline changes? Can you share how you manage the CI/CD process for changes in your data warehouse? What kinds of monitoring or metrics collection do you perform on the execution of your DBT pipelines? How do you integrate the management of your data warehouse and DBT workflows with your broader data platform? Now that you have been using DBT in production for a while, what are the challenges that you have encountered when using it at scale? Are there any patterns that you and your team have found useful that are worth digging into for other teams who are considering DBT or are actively using it? What are the opportunities and available mechanisms that you have found for introducing abstraction layers to reduce the maintenance burden for your data warehouse? What is the data modeling approach that you are using? (e.g. Data Vault, Star/Snowflake Schema, wide tables, etc.) As you continue to work with DBT and rely on the data warehouse for production use cases, what are some of the additional features/improvements that you have planned? What are some of the unexpected/innovative/surprising use cases that you and your team have found for the Seamster tool or the data models that it generates? What are the cases where you think that DBT or data warehousing is the wrong answer and teams should be looking to other solutions? What are the most interesting, unexpected, or challenging lessons that you learned while working through the process of migrating a portion of your data workloads into the data warehouse and managing them with DBT? Contact Info Zeeshan @zeeshanq on Twitter Website Michelle @michellearky on Twitter LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links How to Build a Production Grade Workflow with SQL Modelling Shopify JRuby PySpark Druid Amplitude Mode Snowflake Schema Data Vault Podcast Episode BigQuery Amazon Redshift CI/CD Great Expectations Podcast Episode Master Data Management Podcast Episode Flink SQL The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Feb 2, 2021 • 1h 5min

System Observability For The Cloud Native Era With Chronosphere

Summary Collecting and processing metrics for monitoring use cases is an interesting data problem. It is eminently possible to generate millions or billions of data points per second, the information needs to be propagated to a central location, processed, and analyzed in timeframes on the order of milliseconds or single-digit seconds, and the consumers of the data need to be able to query the information quickly and flexibly. As the systems that we build continue to grow in scale and complexity the need for reliable and manageable monitoring platforms increases proportionately. In this episode Rob Skillington, CTO of Chronosphere, shares his experiences building metrics systems that provide observability to companies that are operating at extreme scale. He describes how the M3DB storage engine is designed to manage the pressures of a critical system component, the inherent complexities of working with telemetry data, and the motivating factors that are contributing to the growing need for flexibility in querying the collected metrics. This is a fascinating conversation about an area of data management that is often taken for granted. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. Today’s episode of Data Engineering Podcast is sponsored by Datadog, the monitoring and analytics platform for cloud-scale infrastructure and applications. Datadog’s machine-learning based alerts, customizable dashboards, and 400+ vendor-backed integrations makes it easy to unify disparate data sources and pivot between correlated metrics and events for faster troubleshooting. By combining metrics, traces, and logs in one place, you can easily improve your application performance. Try Datadog free by starting a your 14-day trial and receive a free t-shirt once you install the agent. Go to dataengineeringpodcast.com/datadog today see how you can unify your monitoring today. Your host is Tobias Macey and today I’m interviewing Rob Skillington about Chronosphere, a scalable, reliable and customizable monitoring-as-a-service purpose built for cloud-native applications. Interview Introduction How did you get involved in the area of data management? Can you start by describing what you are building at Chronosphere and your motivation for turning it into a business? What are the biggest challenges inherent to monitoring use cases? How does the advent of cloud native environments complicate things further? While you were at Uber you helped to create the M3 storage engine. There are a wide array of time series databases available, including many purpose built for metrics use cases. What were the missing pieces that made it necessary to create a new system? How do you handle schema design/data modeling for metrics storage? How do the usage patterns of metrics systems contribute to the complexity of building a storage layer to support them? What are the optimizations that need to be made for the read and write paths in M3? How do you handle high cardinality of metrics and ad-hoc queries to understand system behaviors? What are the scaling factors for M3? Can you describe how you have architected the Chronosphere platform? What are the convenience features built on top of M3 that you are creating at Chronosphere? How do you handle deployment and scaling of your infrastructure given the scale of the businesses that you are working with? Beyond just server infrastructure and application behavior, what are some of the other sources of metrics that you and your users are sending into Chronosphere? How do those alternative metrics sources complicate the work of generating useful insights from the data? In addition to the read and write loads, metrics systems also need to be able to identify patterns, thresholds, and anomalies in the data to alert on it with minimal latency. How do you handle that in the Chronosphere platform? What are some of the most interesting, innovative, or unexpected ways that you have seen Chronosphere/M3 used? What are some of the most interesting, unexpected, or challenging lessons that you have learned while building Chronosphere? When is Chronosphere the wrong choice? What do you have planned for the future of the platform and business? Contact Info LinkedIn @roskilli on Twitter robskillington on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Chronosphere Lidar Cloud Native M3DB OpenTracing Metrics/Telemetry Graphite Podcast.__init__ Episode InfluxDB Clickhouse Podcast Episode Prometheus Inverted Index Druid Cardinality Apache Flink Podcast Episode HDFS Avro Podcast Episode Grafana Tecton Podcast Episode Datadog Podcast Episode Kubernetes Sourcegraph The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Jan 26, 2021 • 34min

Making It Easier To Stick B2B Data Integration Pipelines Together With Hotglue

Summary Businesses often need to be able to ingest data from their customers in order to power the services that they provide. For each new source that they need to integrate with it is another custom set of ETL tasks that they need to maintain. In order to reduce the friction involved in supporting new data transformations David Molot and Hassan Syyid built the Hotlue platform. In this episode they describe the data integration challenges facing many B2B companies, how their work on the Hotglue platform simplifies their efforts, and how they have designed the platform to make these ETL workloads embeddable and self service for end users. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. This episode of Data Engineering Podcast is sponsored by Datadog, a unified monitoring and analytics platform built for developers, IT operations teams, and businesses in the cloud age. Datadog provides customizable dashboards, log management, and machine-learning-based alerts in one fully-integrated platform so you can seamlessly navigate, pinpoint, and resolve performance issues in context. Monitor all your databases, cloud services, containers, and serverless functions in one place with Datadog’s 400+ vendor-backed integrations. If an outage occurs, Datadog provides seamless navigation between your logs, infrastructure metrics, and application traces in just a few clicks to minimize downtime. Try it yourself today by starting a free 14-day trial and receive a Datadog t-shirt after installing the agent. Go to dataengineeringpodcast.com/datadog today to see how you can enhance visibility into your stack with Datadog. Your host is Tobias Macey and today I’m interviewing David Molot and Hassan Syyid about Hotglue, an embeddable data integration tool for B2B developers built on the Python ecosystem. Interview Introduction How did you get involved in the area of data management? Can you start by describing what you are building at Hotglue? What was your motivation for starting a business to address this particular problem? Who is the target user of Hotglue and what are their biggest data problems? What are the types and sources of data that they are likely to be working with? How are they currently handling solutions for those problems? How does the introduction of Hotglue simplify or improve their work? What is involved in getting Hotglue integrated into a given customer’s environment? How is Hotglue itself implemented? How has the design or goals of the platform evolved since you first began building it? What were some of the initial assumptions that you had at the outset and how well have they held up as you progressed? Once a customer has set up Hotglue what is their workflow for building and executing an ETL workflow? What are their options for working with sources that aren’t supported out of the box? What are the biggest design and implementation challenges that you are facing given the need for your product to be embedded in customer platforms and exposed to their end users? What are some of the most interesting, innovative, or unexpected ways that you have seen Hotglue used? What are the most interesting, unexpected, or challenging lessons that you have learned while building Hotglue? When is Hotglue the wrong choice? What do you have planned for the future of the product? Contact Info David @davidmolot on Twitter LinkedIn Hassan hsyyid on GitHub LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Hotglue Python The Python Podcast.__init__ B2B == Business to Business Meltano Podcast Episode Airbyte Singer The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Jan 19, 2021 • 60min

Using Your Data Warehouse As The Source Of Truth For Customer Data With Hightouch

Summary The data warehouse has become the central component of the modern data stack. Building on this pattern, the team at Hightouch have created a platform that synchronizes information about your customers out to third party systems for use by marketing and sales teams. In this episode Tejas Manohar explains the benefits of sourcing customer data from one location for all of your organization to use, the technical challenges of synchronizing the data to external systems with varying APIs, and the workflow for enabling self-service access to your customer data by your marketing teams. This is an interesting conversation about the importance of the data warehouse and how it can be used beyond just internal analytics. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. This episode of Data Engineering Podcast is sponsored by Datadog, a unified monitoring and analytics platform built for developers, IT operations teams, and businesses in the cloud age. Datadog provides customizable dashboards, log management, and machine-learning-based alerts in one fully-integrated platform so you can seamlessly navigate, pinpoint, and resolve performance issues in context. Monitor all your databases, cloud services, containers, and serverless functions in one place with Datadog’s 400+ vendor-backed integrations. If an outage occurs, Datadog provides seamless navigation between your logs, infrastructure metrics, and application traces in just a few clicks to minimize downtime. Try it yourself today by starting a free 14-day trial and receive a Datadog t-shirt after installing the agent. Go to dataengineeringpodcast.com/datadog today to see how you can enhance visibility into your stack with Datadog. Your host is Tobias Macey and today I’m interviewing Tejas Manohar about Hightouch, a data platform that helps you sync your customer data from your data warehouse to your CRM, marketing, and support tools Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of what you are building at Hightouch and your motivation for creating it? What are the main points of friction for teams who are trying to make use of customer data? Where is Hightouch positioned in the ecosystem of customer data tools such as Segment, Mixpanel, Amplitude, etc.? Who are the target users of Hightouch? How has that influenced the design of the platform? What are the baseline attributes necessary for Hightouch to populate downstream systems? What are the data modeling considerations that users need to be aware of when sending data to other platforms? Can you describe how Hightouch is architected? How has the design of the platform evolved since you first began working on it? What goals or assumptions did you have when you first began building Hightouch that have been modified or invalidated once you began working with customers? Can you talk through the workflow of using Hightouch to propagate data to other platforms? How do you keep data up to date between the warehouse and downstream systems? What are the upstream systems that users need to have in place to make Hightouch a viable and effective tool? What are the benefits of using the data warehouse as the source of truth for downstream services? What are the trends in data warehousing that you are keeping a close eye on? What are you most excited for? Are there any that you find worrisome? What are some of the most interesting, unexpected, or innovative ways that you have seen Hightouch used? What are the most interesting, unexpected, or challenging lessons that you have learned while building Hightouch? When is Hightouch the wrong choice? What do you have planned for the future of the platform? Contact Info LinkedIn @tejasmanohar on Twitter tejasmanoharon GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Hightouch Segment Podcast Episode DBT Podcast Episode Looker Podcast Episode Change Data Capture Podcast Episode Database Trigger Materialize Podcast Episode Flink Podcast Episode Zapier The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Jan 11, 2021 • 58min

Enabling Version Controlled Data Collaboration With TerminusDB

Summary As data professionals we have a number of tools available for storing, processing, and analyzing data. We also have tools for collaborating on software and analysis, but collaborating on data is still an underserved capability. Gavin Mendel-Gleason encountered this problem first hand while working on the Sesshat databank, leading him to create TerminusDB and TerminusHub. In this episode he explains how the TerminusDB system is architected to provide a versioned graph storage engine that allows for branching and merging of data sets, how that opens up new possibilities for individuals and teams to work together on building new data repositories. This is a fascinating conversation on the technical challenges involved, the opportunities that such as system provides, and the complexities inherent to building a successful business on open source. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Do you want to get better at Python? Now is an excellent time to take an online course. Whether you’re just learning Python or you’re looking for deep dives on topics like APIs, memory mangement, async and await, and more, our friends at Talk Python Training have a top-notch course for you. If you’re just getting started, be sure to check out the Python for Absolute Beginners course. It’s like the first year of computer science that you never took compressed into 10 fun hours of Python coding and problem solving. Go to dataengineeringpodcast.com/talkpython today and get 10% off the course that will help you find your next level. That’s dataengineeringpodcast.com/talkpython, and don’t forget to thank them for supporting the show. You invest so much in your data infrastructure – you simply can’t afford to settle for unreliable data. Fortunately, there’s hope: in the same way that New Relic, DataDog, and other Application Performance Management solutions ensure reliable software and keep application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo’s end-to-end Data Observability Platform monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence. The platform uses machine learning to infer and learn your data, proactively identify data issues, assess its impact through lineage, and notify those who need to know before it impacts the business. By empowering data teams with end-to-end data reliability, Monte Carlo helps organizations save time, increase revenue, and restore trust in their data. Visit dataengineeringpodcast.com/montecarlo today to request a demo and see how Monte Carlo delivers data observability across your data infrastructure. The first 25 will receive a free, limited edition Monte Carlo hat! Your host is Tobias Macey and today I’m interviewing Gavin Mendel-Gleason about TerminusDB, an open source model driven graph database for knowledge graph representation Interview Introduction How did you get involved in the area of data management? Can you start by describing what TerminusDB is and what motivated you to build it? What are the use cases that TerminusDB and TerminusHub are designed for? There are a number of different reasons and methods for versioning data, such as the work being done with Datomic, LakeFS, DVC, etc. Where does TerminusDB fit in relation to those and other data versioning systems that are available today? Can you describe how TerminusDB is implemented? How has the design changed or evolved since you first began working on it? What was the decision process and design considerations that led you to choose Prolog as the implementation language? One of the challenges that have faced other knowledge engines built around RDF is that of scale and performance. How are you addressing those difficulties in TerminusDB? What are the scaling factors and limitations for TerminusDB? (e.g. volumes of data, clustering, etc.) How does the use of RDF triples and JSON-LD impact the audience for TerminusDB? How much overhead is incurred by maintaining a long history of changes for a database? How do you handle garbage collection/compaction of versions? How does the availability of branching and merging strategies change the approach that data teams take when working on a project? What are the edge cases in merging and conflict resolution, and what tools does TerminusDB/TerminusHub provide for working through those situations? What are some useful strategies that teams should be aware of for working effectively with collaborative datasets in TerminusDB? Another interesting element of the TerminusDB platform is the query language. What did you use as inspiration for designing it and how much of a learning curve is involved? What are some of the most interesting, innovative, or unexpected ways that you have seen TerminusDB used? https://en.wikipedia.org/wiki/Semantic_Web-?utm_source=rss&utm_medium=rss What are the most interesting, unexpected, or challenging lessons that you have learned while building TerminusDB and TerminusHub? When is TerminusDB the wrong choice? What do you have planned for the future of the project? Contact Info @GavinMGleason on Twitter LinkedIn GavinMendelGleason on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links TerminusDB TerminusHub Chem Informatics Type Theory Graph Database Trinity College Dublin Sesshat Databank analytics over civilizations in history PostgreSQL DGraph Grakn Neo4J Datomic LakeFS DVC Dolt Persistent Succinct Data Structure Currying Prolog WOQL TerminusDB query language RDF JSON-LD Semantic Web Property Graph Hypergraph Super Node Bloom Filters Data Curation Podcast Episode CRDT == Conflict-Free Replicated Data Types Podcast Episode SPARQL Datalog AST == Abstract Syntax Tree The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
22 snips
Jan 5, 2021 • 48min

Bringing Feature Stores and MLOps to the Enterprise at Tecton

Kevin Stumpf, Co-founder and CTO of Tecton, discusses the evolution of feature stores and their essential role in modern machine learning ops. He shares insights from his experience with Uber's Michelangelo platform and explains how Tecton simplifies feature creation for data scientists. Topics include the architecture of Tecton, the importance of observability in data management, and the challenges of integrating machine learning workflows. Stumpf also touches on the balance between open-source and enterprise solutions in the ever-evolving data landscape.
undefined
4 snips
Dec 28, 2020 • 34min

Off The Shelf Data Governance With Satori

Summary One of the core responsibilities of data engineers is to manage the security of the information that they process. The team at Satori has a background in cybersecurity and they are using the lessons that they learned in that field to address the challenge of access control and auditing for data governance. In this episode co-founder and CTO Yoav Cohen explains how the Satori platform provides a proxy layer for your data, the challenges of managing security across disparate storage systems, and their approach to building a dynamic data catalog based on the records that your organization is actually using. This is an interesting conversation about the intersection of data and security and the lessons that can be learned in each direction. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Your host is Tobias Macey and today I’m interviewing Yoav Cohen about Satori, a data access service to monitor, classify and control access to sensitive data Interview Introduction How did you get involved in the area of data management? Can you start by describing what you have built at Satori? What is the story behind the product and company? How does Satori compare to other tools and products for managing access control and governance for data assets? What are the biggest challenges that organizations face in establishing and enforcing policies for their data? What are the main goals for the Satori product and what use cases does it enable? Can you describe how the Satori platform is architected? How has the design of the platform evolved since you first began working on it? How have your experiences working in cyber security informed your approach to data governance? How does the design of the Satori platform simplify technical aspects of data governance? What aspects of governance do you delegate to other systems or platforms? What elements of data infrastructure does Satori integrate with? For someone who is adopting Satori, what is involved in getting it deployed and set up with their existing data platforms? What do you see as being the most complex or underserved aspects of data governance? How much of that complexity is inherent to the problem vs. being a result of how the industry has evolved? What are some of the most interesting, innovative, or unexpected ways that you have seen the Satori platform used? What are the most interesting, unexpected, or challenging lessons that you have learned while building Satori? When is Satori the wrong choice? What do you have planned for the future of the platform? Contact Info LinkedIn @yoavcohen on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Satori Data Governance Data Masking TLS == Transport Layer Security The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Dec 21, 2020 • 54min

Low Friction Data Governance With Immuta

Summary Data governance is a term that encompasses a wide range of responsibilities, both technical and process oriented. One of the more complex aspects is that of access control to the data assets that an organization is responsible for managing. The team at Immuta has built a platform that aims to tackle that problem in a flexible and maintainable fashion so that data teams can easily integrate authorization, data masking, and privacy enhancing technologies into their data infrastructure. In this episode Steve Touw and Stephen Bailey share what they have built at Immuta, how it is implemented, and how it streamlines the workflow for everyone involved in working with sensitive data. If you are starting down the path of implementing a data governance strategy then this episode will provide a great overview of what is involved. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Feature flagging is a simple concept that enables you to ship faster, test in production, and do easy rollbacks without redeploying code. Teams using feature flags release new software with less risk, and release more often. ConfigCat is a feature flag service that lets you easily add flags to your Python code, and 9 other platforms. By adopting ConfigCat you and your manager can track and toggle your feature flags from their visual dashboard without redeploying any code or configuration, including granular targeting rules. You can roll out new features to a subset or your users for beta testing or canary deployments. With their simple API, clear documentation, and pricing that is independent of your team size you can get your first feature flags added in minutes without breaking the bank. Go to dataengineeringpodcast.com/configcat today to get 35% off any paid plan with code DATAENGINEERING or try out their free forever plan. You invest so much in your data infrastructure – you simply can’t afford to settle for unreliable data. Fortunately, there’s hope: in the same way that New Relic, DataDog, and other Application Performance Management solutions ensure reliable software and keep application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo’s end-to-end Data Observability Platform monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence. The platform uses machine learning to infer and learn your data, proactively identify data issues, assess its impact through lineage, and notify those who need to know before it impacts the business. By empowering data teams with end-to-end data reliability, Monte Carlo helps organizations save time, increase revenue, and restore trust in their data. Visit dataengineeringpodcast.com/montecarlo today to request a demo and see how Monte Carlo delivers data observability across your data infrastructure. The first 25 will receive a free, limited edition Monte Carlo hat! Your host is Tobias Macey and today I’m interviewing Steve Touw and Stephen Bailey about Immuta and how they work to automate data governance Interview Introduction How did you get involved in the area of data management? Can you start by describing what you have built at Immuta and your motivation for starting the company? What is data governance? How much of data governance can be solved with technology and how much is a matter of process and communication? What does the current landscape of data governance solutions look like? What are the motivating factors that would lead someone to choose Immuta as a component of their data governance strategy? How does Immuta integrate with the broader ecosystem of data tools and platforms? What other workflows or activities are necessary outside of Immuta to ensure a comprehensive governance/compliance strategy? What are some of the common blind spots when it comes to data governance? How is the Immuta platform architected? How have the design and goals of the system evolved since you first started building it? What is involved in adopting Immuta for an existing data platform? Once an organization has integrated Immuta, what are the workflows for the different stakeholders of the data? What are the biggest challenges in automated discovery/identification of sensitive data? How does the evolution of what qualifies as sensitive complicate those efforts? How do you approach the challenge of providing a unified interface for access control and auditing across different systems (e.g. BigQuery, Snowflake, RedShift, etc.)? What are the complexities that creep into data masking? What are some alternatives for obfuscating and managing access to sensitive information? How do you handle managing access control/masking/tagging for derived data sets? What are some of the most interesting, unexpected, or challenging lessons that you have learned while building Immuta? When is Immuta the wrong choice? What do you have planned for the future of the platform and business? Contact Info Steve LinkedIn @steve_touw on Twitter Stephen LinkedIn Website Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Immuta Data Governance Data Catalog Snowflake DB Podcast Episode Looker Podcast Episode Collibra ABAC == Attribute Based Access Control RBAC == Role Based Access Control Paul Ohm: Broken Promises of Privacy PET == Privacy Enhancing Technologies K Anonymization Differential Privacy LDAP == Lightweight Directory Access Protocol Active Directory COVID Alliance HIPAA GDPR CCPA The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app