Data Engineering Podcast

Tobias Macey
undefined
Oct 9, 2023 • 52min

Using Data To Illuminate The Intentionally Opaque Insurance Industry

Max Cho, Founder of a business to make policy selection more navigable, discusses the challenges of navigating the opaque insurance industry. Topics include data collection and analysis, automating a manual industry, insurance pricing transparency, challenges of AI navigation, data preprocessing for analysis, understanding policy complexities, and the utility of large language models in the insurance industry.
undefined
21 snips
Oct 1, 2023 • 52min

Building ETL Pipelines With Generative AI

AI's impact on ETL processes, using generative AI for unstructured data, AI's role in ETL pipelines, experimenting with AI models, evolving role of AI assistants in data engineering, considerations and challenges of using AI in ETL pipelines, changing landscape of ETL tools
undefined
Sep 25, 2023 • 59min

Powering Vector Search With Real Time And Incremental Vector Indexes

This podcast discusses the growth of machine learning and the need for vector search capabilities. They explore the challenges of real-time indexes, the benefits of semantic search, and incorporating vector search into data flows. They also cover the considerations and limitations of vector search and share insights on working with vector databases.
undefined
103 snips
Sep 17, 2023 • 1h 2min

Building Linked Data Products With JSON-LD

In this podcast, Brian Platz discusses the concept and implications of linked data, the benefits of using JSON-LD for building semantic data products, the challenges faced in building linked data products, and the need for improved data management tools.
undefined
25 snips
Sep 10, 2023 • 1h 1min

An Overview Of The State Of Data Orchestration In An Increasingly Complex Data Ecosystem

Nick Schrock, creator of Dagster, discusses the state of data orchestration technology and its application. They explore the challenges and benefits of orchestrators, the balance between information and infrastructure, and the capabilities and challenges of data orchestration. They also discuss low code and no code solutions in data work, their integration into software engineering, and the role of data orchestration in ML workflows.
undefined
15 snips
Sep 4, 2023 • 42min

Eliminate The Overhead In Your Data Integration With The Open Source dlt Library

The podcast explores the dlt project, an open source Python library for data loading. It discusses the challenges in data integration, the benefits of dlt over other tools, and how to start building pipelines. Other topics include the journey of becoming a data engineer, performance considerations of using Python, collaboration in data integration, and integration with different runtimes. The hosts emphasize the need for better education in data management and practical solutions.
undefined
Aug 28, 2023 • 1h 1min

Building An Internal Database As A Service Platform At Cloudflare

This podcast explores how Cloudflare provides PostgreSQL as a service to their developers for low latency and high uptime services at global scale. They discuss challenges in maintaining high uptime and managing data volume, scaling considerations and load balancing strategies, the evolvement of database engines, differences in version upgrades between Postgres and MySQL, innovative usage and challenges in building a database platform at Cloudflare, and lessons learned in building their system.
undefined
Aug 20, 2023 • 55min

Harnessing Generative AI For Creating Educational Content With Illumidesk

Generative AI in educational content creation, building a data-driven experience for learners, challenges of dealing with large amounts of data, analyzing learner interactions and improving content development, data normalization and personalized learning paths, implementation and architecture of Illumidesk platform, evolution of platform and incorporating LLM framework into data engineering pipeline, application and usage of Illumidesk for content creation.
undefined
35 snips
Aug 14, 2023 • 47min

Unpacking The Seven Principles Of Modern Data Pipelines

Summary Data pipelines are the core of every data product, ML model, and business intelligence dashboard. If you're not careful you will end up spending all of your time on maintenance and fire-fighting. The folks at Rivery distilled the seven principles of modern data pipelines that will help you stay out of trouble and be productive with your data. In this episode Ariel Pohoryles explains what they are and how they work together to increase your chances of success. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold Your host is Tobias Macey and today I'm interviewing Ariel Pohoryles about the seven principles of modern data pipelines Interview Introduction How did you get involved in the area of data management? Can you start by defining what you mean by a "modern" data pipeline? At Rivery you published a white paper identifying seven principles of modern data pipelines: Zero infrastructure management ELT-first mindset Speaks SQL and Python Dynamic multi-storage layers Reverse ETL & operational analytics Full transparency Faster time to value What are the applications of data that you focused on while identifying these principles? How do the application of these principles influence the ability of organizations and their data teams to encourage and keep pace with the use of data in the business? What are the technical components of a pipeline infrastructure that are necessary to support a "modern" workflow? How do the technologies involved impact the organizational involvement with how data is applied throughout the business? When using managed services, what are the ways that the pricing model acts to encourage/discourage experimentation/exploration with data? What are the most interesting, innovative, or unexpected ways that you have seen these seven principles implemented/applied? What are the most interesting, unexpected, or challenging lessons that you have learned while working with customers to adapt to these principles? What are the cases where some/all of these principles are undesirable/impractical to implement? What are the opportunities for further advancement/sophistication in the ways that teams work with and gain value from data? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Rivery 7 Principles Of The Modern Data Pipeline ELT Reverse ETL Martech Landscape Data Lakehouse Databricks Snowflake The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Datafold: ![Datafold](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/zm6x2tFu.png) This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting [dataengineeringpodcast.com/datafold](https://www.dataengineeringpodcast.com/datafold) today!Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png) Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)Support Data Engineering Podcast
undefined
10 snips
Aug 6, 2023 • 1h 2min

Quantifying The Return On Investment For Your Data Team

Exploring how to calculate the ROI for data teams, the podcast covers methods of measuring ROI, collecting and analyzing data for efficiency, optimizing queries, generative AI, innovative approaches to ROI, and the biggest gaps in data management tooling.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app