Data Engineering Podcast cover image

Data Engineering Podcast

Latest episodes

undefined
May 23, 2022 • 1h 11min

Unlocking The Value Of Data Across The Organization Through User Friendly Data Tools With Prophecy

Summary The interfaces and design cues that a tool offers can have a massive impact on who is able to use it and the tasks that they are able to perform. With an eye to making data workflows more accessible to everyone in an organization Raj Bains and his team at Prophecy designed a powerful and extensible low-code platform that lets technical and non-technical users scale data flows without forcing everyone into the same layers of abstraction. In this episode he explores the tension between code-first and no-code utilities and how he is working to balance the strengths without falling prey to their shortcomings. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan. Your host is Tobias Macey and today I’m interviewing Raj Bains about how improving the user experience for data tools can make your work as a data engineer better and easier Interview Introduction How did you get involved in the area of data management? What are the broad categories of data tool designs that are available currently and how does that impact what is possible with them? What are the points of friction that are introduced by the tools? Can you share some of the types of workarounds or wasted effort that are made necessary by those design elements? What are the core design principles that you have built into Prophecy to address these shortcomings? How do those user experience changes improve the quality and speed of work for data engineers? How has the Prophecy platform changed since we last spoke almost a year ago? What are the tradeoffs of low code systems for productivity vs. flexibility and creativity? What are the most interesting, innovative, or unexpected approaches to developer experience that you have seen for data tools? What are the most interesting, unexpected, or challenging lessons that you have learned while working on user experience optimization for data tooling at Prophecy? When is it more important to optimize for computational efficiency over developer productivity? What do you have planned for the future of Prophecy? Contact Info LinkedIn @_raj_bains on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Prophecy Podcast Episode CUDA Clustrix Hortonworks Apache Hive Compilerworks Podcast Episode Airflow Databricks Fivetran Podcast Episode Airbyte Podcast Episode Streamsets Change Data Capture Apache Pig Spark Scala Ab Initio Type 2 Slowly Changing Dimensions AWS Deequ Matillion Podcast Episode Prophecy SaaS The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
May 23, 2022 • 1h 7min

Cloud Native Data Orchestration For Machine Learning And Data Engineering With Flyte

Joining the discussion are Ketan Umare, CEO and co-founder at Union, who initiated Flyte at Lyft, and Haytham Abuelfutuh, Union's CTO, who also built Flyte there. They dive into the complexities of data orchestration in machine learning, comparing traditional tools to Flyte's innovative engine on Kubernetes. The conversation highlights the architectural design for user-friendly workflows and applications of Flyte in diverse sectors, including biotech and gaming. They also discuss the balance between open-source principles and sustainable business models.
undefined
May 16, 2022 • 58min

Insights And Advice On Building A Data Lake Platform From Someone Who Learned The Hard Way

Summary Designing a data platform is a complex and iterative undertaking which requires accounting for many conflicting needs. Designing a platform that relies on a data lake as its central architectural tenet adds additional layers of difficulty. Srivatsan Sridharan has had the opportunity to design, build, and run data lake platforms for both Yelp and Robinhood, with many valuable lessons learned from each experience. In this episode he shares his insights and advice on how to approach such an undertaking in your own organization. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. Your host is Tobias Macey and today I’m interviewing Srivatsan Sridharan about the technological, staffing, and design considerations for building a data platform Interview Introduction How did you get involved in the area of data management? Can you describe what your experience has been with designing and implementing data platforms? What are the elements that you have found to be common requirements across organizations and data characteristics? What are the architectural elements that require the most detailed consideration based on organizational needs and data requirements? How has the ecosystem for building maintainable and usable data lakes matured over the past few years? What are the elements that are still cumbersome or intractable? The streaming ecosystem has also gone through substantial changes over the past few years. What is your synopsis of the meaningful differences between todays options and where we were ~6 years ago? How did your experiences at Yelp inform your current architectural approach at Robinhood? Can you describe your current platform architecture? What are the primary capabilities that you are optimizing for? What is your evaluation process for determining what components to use in your platform? How do you approach the build vs. buy problem and quantify the tradeoffs? What are the most interesting, innovative, or unexpected ways that you have seen your data systems used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on designing and implementing data platforms across your career? When is a data lake architecture the wrong choice? What do you have planned for the future of the data platform at Robinhood? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Robinhood Yelp Kafka Spark Flink Podcast Episode Pulsar Podcast Episode Parquet Change Data Capture Delta Lake Podcast Episode Hudi Podcast Episode Redshift BigQuery Informatica Data Mesh Podcast Episode PrestoDB Trino Airbyte Podcast Episode Meltano Podcast Episode Fivetran Podcast Episode Stitch Pinot Podcast Episode Clickhouse Podcast Episode Druid Iceberg Podcast Episode Looker Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
May 16, 2022 • 48min

Designing And Deploying IoT Analytics For Industrial Applications At Vopak

Summary Industrial applications are one of the primary adopters of Internet of Things (IoT) technologies, with business critical operations being informed by data collected across a fleet of sensors. Vopak is a business that manages storage and distribution of a variety of liquids that are critical to the modern world, and they have recently launched a new platform to gain more utility from their industrial sensors. In this episode Mário Pereira shares the system design that he and his team have developed for collecting and managing the collection and analysis of sensor data, and how they have split the data processing and business logic responsibilities between physical terminals and edge locations, and centralized storage and compute. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan. Your host is Tobias Macey and today I’m interviewing Mário Pereira about building a data management system for globally distributed IoT sensors at Vopak Interview Introduction How did you get involved in the area of data management? Can you describe what Vopak is and what kinds of information you rely on to power the business? What kinds of sensors and edge devices are you using? What kinds of consistency or variance do you have between sensors across your locations? How much computing power and storage space do you place at the edge? What level of pre-processing/filtering is being done at the edge and how do you decide what information needs to be centralized? What are some examples of decision-making that happens at the edge? Can you describe the platform architecture that you have built for collecting and processing sensor data? What was your process for selecting and evaluating the various components? How much tolerance do you have for missed messages/dropped data? How long are your data retention periods and what are the factors that influence that policy? What are some statistics related to the volume, variety, and velocity of your data? What are the end-to-end latency requirements for different segments of your data? What kinds of analysis are you performing on the collected data? What are some of the potential ramifications of failures in your system? (e.g. spills, explosions, spoilage, contamination, revenue loss, etc.) What are some of the scaling issues that you have experienced as you brought your system online? How have you been managing the decision making prior to implementing these technology solutions? What are the new capabilities and business processes that are enabled by this new platform? What are the most interesting, innovative, or unexpected ways that you have seen your data capabilities applied? What are the most interesting, unexpected, or challenging lessons that you have learned while working on building an IoT collection and aggregation platform at global scale? What do you have planned for the future of your IoT system? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Vopak Swinging Door Compression Algorithm IoT Greengrass OPCUA IoT protocol MongoDB AWS Kinesis AWS Batch AWS IoT Sitewise Edge Boston Dynamics The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
May 9, 2022 • 1h 1min

Exploring The Insights And Impact Of Dan Delorey's Distinguished Career In Data

Summary Dan Delorey helped to build the core technologies of Google’s cloud data services for many years before embarking on his latest adventure as the VP of Data at SoFi. From being an early engineer on the Dremel project, to helping launch and manage BigQuery, on to helping enterprises adopt Google’s data products he learned all of the critical details of how to run services used by data platform teams. Now he is the consumer of many of the tools that his work inspired. In this episode he takes a trip down memory lane to weave an interesting and informative narrative about the broader themes throughout his work and their echoes in the modern data ecosystem. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan. Your host is Tobias Macey and today I’m interviewing Dan Delorey about his journey through the data ecosystem as the current head of data at SoFi, prior engineering leader with the BigQuery team, and early engineer on Dremel Interview Introduction How did you get involved in the area of data management? Can you start by sharing what your current relationship to the data ecosystem is and the cliffs-notes version of how you ended up there? Dremel was a ground-breaking technology at the time. What do you see as its lasting impression on the landscape of data both in and outside of Google? You were instrumental in crafting the vision behind "querying data in place," (what they called, federated data) at Dremel and BigQuery. What do you mean by this? How has this approach evolved? What are some challenges with this approach? How well did the Drill project capture the core principles of Dremel as outlined in the eponymous white paper? Following your work on Drill you were involved with the development and growth of BigQuery and the broader suite of Google Cloud’s data platform. What do you see as the influence that those tools had on the evolution of the broader data ecosystem? How have your experiences at Google influenced your approach to platform and organizational design at SoFi? What’s in SoFi’s data stack? How do you decide what technologies to buy vs. build in-house? How does your team solve for data quality and governance? What are the dominating factors that you consider when deciding on project/product priorities for your team? When you’re not building industry-defining data tooling or leading data strategy, you spend time thinking about the ethics of data. Can you elaborate a bit about your research and interest there? You also have some ideas about data marketplaces, which is a hot topic these days with companies like Snowflake and Databricks breaking into this economy. What’s your take on the evolution of this space? What are the most interesting, innovative, or unexpected data systems that you have encountered? What are the most interesting, unexpected, or challenging lessons that you have learned while working on building and supporting data systems? What are the areas that you are paying the most attention to? What interesting predictions do you have for the future of data systems and their applications? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links SoFi Bigquery Dremel Brigham Young University Empirical Software Engineering Map/Reduce Hadoop Sawzall VLDB Test Of Time Award Paper GFS Colossus Partitioned Hash Join Google BigTable HBase AWS Athena Snowflake Podcast Episode Data Vault Star Schema Privacy Vault Homomorphic Encryption The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
May 9, 2022 • 40min

Scaling Analysis of Connected Data And Modeling Complex Relationships With The TigerGraph Graph Database

Summary Many of the events, ideas, and objects that we try to represent through data have a high degree of connectivity in the real world. These connections are best represented and analyzed as graphs to provide efficient and accurate analysis of their relationships. TigerGraph is a leading database that offers a highly scalable and performant native graph engine for powering graph analytics and machine learning. In this episode Jon Herke shares how TigerGraph customers are taking advantage of those capabilities to achieve meaningful discoveries in their fields, the utilities that it provides for modeling and managing your connected data, and some of his own experiences working with the platform before joining the company. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit http://www.dataengineeringpodcast.com/montecarlo?utm_source=rss&utm_medium=rss to learn more. Your host is Tobias Macey and today I’m interviewing Jon Herke about TigerGraph, a distributed native graph database Interview Introduction How did you get involved in the area of data management? Can you describe what TigerGraph is and the story behind it? What are some of the core use cases that you are focused on supporting? How has TigerGraph changed over the past 4 years since I spoke with Todd Blaschka at the Open Data Science Conference? How has the ecosystem of graph databases changed in usage and design in recent years? What are some of the persistent areas of confusion or misinformation that you encounter when explaining graph databases and TigerGraph to potential users? The tagline on your website says that TigerGraph is "The Only Scalable Graph Database for the Enterprise". Can you unpack that claim and explain what is necessary for a graph database to be suitable for enterprise use? What are some of the typical application and system architectures that you typically see for end-users of TigerGraph? (e.g. polyglot persistence, etc.) What are the cases where TigerGraph should be the system of record as opposed to an optimization option for addressing highly connected data? What are the data modeling considerations that end-users should be thinking of when planning their storage structures in TigerGraph? What are the most interesting, innovative, or unexpected ways that you have seen TigerGraph used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on TigerGraph? When is TigerGraph the wrong choice? What do you have planned for the future of TigerGraph? Contact Info LinkedIn @jonherke on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links TigerGraph GraphQL Kafka GQL (Graph Query Language) LDBC (Linked Data Benchmark Council) The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
May 2, 2022 • 53min

Leading The Charge For The ELT Data Integration Pattern For Cloud Data Warehouses At Matillion

Summary The predominant pattern for data integration in the cloud has become extract, load, and then transform or ELT. Matillion was an early innovator of that approach and in this episode CTO Ed Thompson explains how they have evolved the platform to keep pace with the rapidly changing ecosystem. He describes how the platform is architected, the challenges related to selling cloud technologies into enterprise organizations, and how you can adopt Matillion for your own workflows to reduce the maintenance burden of data integration workflows. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit http://www.dataengineeringpodcast.com/montecarlo?utm_source=rss&utm_medium=rss to learn more. Your host is Tobias Macey and today I’m interviewing Ed Thompson about Matillion, a cloud-native data integration platform for accelerating your time to analytics Interview Introduction How did you get involved in the area of data management? Can you describe what Matillion is and the story behind it? What are the use cases and user personas that you are focused on supporting? How does that influence the focus and pace of your feature development and priorities? How is Matillion architected? How have the design and goals of the system changed since you started working on it? The ecosystems of both cloud technologies and data processing have been rapidly growing and evolving, with new patterns and paradigms being introduced. What are the elements of your product focus and messaging that you have had to update and what are the core principles that have stayed the same? What have been the most challenging integrations to build and support? What is a typical workflow for integrating Matillion into an organization and building a set of pipelines? What are some of the patterns that have been useful for managing incidental complexity as usage scales? What are the most interesting, innovative, or unexpected ways that you have seen Matillion used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Matillion? When is Matillion the wrong choice? What do you have planned for the future of Matillion? Contact Info LinkedIn Matillion Contact Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Matillion Twitter IBM DB2 Cognos Talend Redshift AWS Marketplace AWS Re:Invent Azure GCP == Google Cloud Platform Informatica SSIS == SQL Server Integration Services PCRE == Perl Compatible Regular Expressions Teradata Tomcat Collibra Alation The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
May 2, 2022 • 1h 4min

Evolving And Scaling The Data Platform at Yotpo

Summary Building a data platform is an iterative and evolutionary process that requires collaboration with internal stakeholders to ensure that their needs are being met. Yotpo has been on a journey to evolve and scale their data platform to continue serving the needs of their organization as it increases the scale and sophistication of data usage. In this episode Doron Porat and Liran Yogev explain how they arrived at their current architecture, the capabilities that they are optimizing for, and the complex process of identifying and evaluating new components to integrate into their systems. This is an excellent exploration of the decisions and tradeoffs that need to be made while building such a complex system. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. The most important piece of any data project is the data itself, which is why it is critical that your data source is high quality. PostHog is your all-in-one product analytics suite including product analysis, user funnels, feature flags, experimentation, and it’s open source so you can host it yourself or let them do it for you! You have full control over your data and their plugin system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms. Give it a try today with their generous free tier at dataengineeringpodcast.com/posthog Your host is Tobias Macey and today I’m interviewing Doron Porat and Liran Yogev about their experiences designing and implementing a self-serve data platform at Yotpo Interview Introduction How did you get involved in the area of data management? Can you describe what Yotpo is and the role that data plays in the organization? What are the core data types and sources that you are working with? What kinds of data assets are being produced and how do those get consumed and re-integrated into the business? What are the user personas that you are supporting and what are the interfaces that they are comfortable interacting with? What is the size of your team and how is it structured? You recently posted about the current architecture of your data platform. What was the starting point on your platform journey? What did the early stages of feature and platform evolution look like? What was the catalyst for making a concerted effort to integrate your systems into a cohesive platform? What was the scope and directive of the project for building a platform? What are the metrics and capabilities that you are optimizing for in the structure of your data platform? What are the organizational or regulatory constraints that you needed to account for? What are some of the early decisions that affected your available choices in later stages of the project? What does the current state of your architecture look like? How long did it take to get to where you are today? What were the factors that you considered in the various build vs. buy decisions? How did you manage cost modeling to understand the true savings on either side of that decision? If you were to start from scratch on a new data platform today what might you do differently? What are the decisions that proved helpful in the later stages of your platform development? What are the most interesting, innovative, or unexpected ways that you have seen your platform used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on designing and implementing your platform? What do you have planned for the future of your platform infrastructure? Contact Info Doron LinkedIn Liran LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Yotpo Data Platform Architecture Blog Post Greenplum Databricks Metorikku Apache Hive CDC == Change Data Capture Debezium Podcast Episode Apache Hudi Podcast Episode Upsolver Podcast Episode Spark PrestoDB Snowflake Podcast Episode Druid Rockset Podcast Episode dbt Podcast Episode Acryl Podcast Episode Atlan Podcast Episode OpenLineage Podcast Episode Okera Shopify Data Warehouse Episode Redshift Delta Lake Podcast Episode Iceberg Podcast Episode Outbox Pattern Backstage Roadie Nomad Kubernetes Deequ Great Expectations Podcast Episode LakeFS Podcast Episode 2021 Recap Episode Monte Carlo The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
undefined
Apr 24, 2022 • 1h 11min

Operational Analytics At Speed With Minimal Busy Work Using Incorta

Summary A huge amount of effort goes into modeling and shaping data to make it available for analytical purposes. This is often due to the need to simplify the final queries so that they are performant for visualization or limited exploration. In order to cut down the level of effort involved in making data usable, Matthew Halliday and his co-founders created Incorta as an end-to-end, in-memory analytical engine that removes barriers to insights on your data. In this episode he explains how the system works, the use cases that it empowers, and how you can start using it for your own analytics today. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit http://www.dataengineeringpodcast.com/montecarlo?utm_source=rss&utm_medium=rss to learn more. Your host is Tobias Macey and today I’m interviewing Matthew Halliday about Incorta, an in-memory, unified data and analytics platform as a service Interview Introduction How did you get involved in the area of data management? Can you describe what Incorta is and the story behind it? What are the use cases and customers that you are focused on? How does that focus inform the design and priorities of functionality in the product? What are the technologies and workflows that Incorta might replace? What are the systems and services that it is intended to integrate with and extend? Can you describe how Incorta is implemented? What are the core technological decisions that were necessary to make the product successful? How have the design and goals of the system changed and evolved since you started working on it? Can you describe the workflow for building an end-to-end analysis using Incorta? What are some of the new capabilities or use cases that Incorta enables which are impractical or intractable with other combinations of tools in the ecosystem? How do the features of Incorta influence the approach that teams take for data modeling? What are the points of collaboration and overlap between organizational roles while using Incorta? What are the most interesting, innovative, or unexpected ways that you have seen Incorta used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Incorta? When is Incorta the wrong choice? What do you have planned for the future of Incorta? Contact Info LinkedIn @layereddelay on Twitter Website Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Incorta 3rd Normal Form Parquet Podcast Episode Delta Lake Podcast Episode Iceberg Podcast Episode PrestoDB PySpark Dataiku Angular React Apache ECharts The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Apr 24, 2022 • 59min

Gain Visibility Into Your Entire Machine Learning System Using Data Logging With WhyLogs

Summary There are very few tools which are equally useful for data engineers, data scientists, and machine learning engineers. WhyLogs is a powerful library for flexibly instrumenting all of your data systems to understand the entire lifecycle of your data from source to productionized model. In this episode Andy Dang explains why the project was created, how you can apply it to your existing data systems, and how it functions to provide detailed context for being able to gain insight into all of your data processes. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. The most important piece of any data project is the data itself, which is why it is critical that your data source is high quality. PostHog is your all-in-one product analytics suite including product analysis, user funnels, feature flags, experimentation, and it’s open source so you can host it yourself or let them do it for you! You have full control over your data and their plugin system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms. Give it a try today with their generous free tier at dataengineeringpodcast.com/posthog Your host is Tobias Macey and today I’m interviewing Andy Dang about powering observability of AI systems with the whylogs data logging library Interview Introduction How did you get involved in the area of data management? Can you describe what Whylabs is and the story behind it? How is "data logging" differentiated from logging for the purpose of debugging and observability of software logic? What are the use cases that you are aiming to support with Whylogs? How does it compare to libraries and services like Great Expectations/Monte Carlo/Soda Data/Datafold etc. Can you describe how Whylogs is implemented? How have the design and goals of the project changed or evolved since you started working on it? How do you maintain feature parity between the Python and Java integrations? How do you structure the log events and metadata to provide detail and context for data applications? How does that structure support aggregation and interpretation/analysis of the log information? What is the process for integrating Whylogs into an existing project? Once you have the code instrumented with log events, what is the workflow for using Whylogs to debug and maintain a data application? What have you found to be useful heuristics for identifying what to log? What are some of the strategies that teams can use to maintain a balance of signal vs. noise in the events that they are logging? How is the Whylogs governance set up and how are you approaching sustainability of the open source project? What are the additional utilities and services that you anticipate layering on top of/integrating with Whylogs? What are the most interesting, innovative, or unexpected ways that you have seen Whylogs used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Whylabs? When is Whylogs/Whylabs the wrong choice? What do you have planned for the future of Whylabs? Contact Info LinkedIn @andy_dng on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Whylogs Whylabs Spark Airflow Pandas Podcast Episode Data Sketches Grafana Great Expectations Podcast Episode Monte Carlo Podcast Episode Soda Data Podcast Episode Datafold Podcast Episode Delta Lake Podcast Episode HyperLogLog MLFlow Flyte The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner