
Data Engineering Podcast
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Latest episodes

May 21, 2025 • 50min
From Data Discovery to AI: The Evolution of Semantic Layers
Shinji Kim, Founder and CEO of SelectStar, shares insights on the evolving role of semantic layers in AI. He discusses the journey from statistical analysis to data governance, highlighting challenges enterprises face with data access. The conversation covers the shift from centralized to decentralized data teams and the importance of metadata management. Shinji emphasizes the critical role of semantic modeling for business intelligence and how AI can enhance data accuracy. He also explores the future of semantic modeling in data warehouses, addressing operationalization challenges.

10 snips
May 13, 2025 • 46min
Balancing Off-the-Shelf and Custom Solutions in Data Engineering
Tulika Bhatt, a senior software engineer at Netflix specializing in impression data, shares her journey from BlackRock and Verizon to shaping data services at a top streaming service. She discusses the challenges of balancing off-the-shelf solutions with custom systems, utilizing technologies like Spark and Flink. Tulika dives into the intricacies of ensuring data quality and observability, emphasizing automation and robust alerting strategies. She also explores the integration of AI in data engineering, highlighting its potential and the hurdles faced in maximizing efficiency.

9 snips
May 5, 2025 • 60min
StarRocks: Bridging Lakehouse and OLAP for High-Performance Analytics
Sida Shen, a product manager at CelerData and a contributor to StarRocks, dives into the innovative world of high-performance analytical databases. He shares the origins of StarRocks, illustrating its evolution from Apache Doris into a robust Lakehouse query engine. Topics include handling high concurrency and low latency queries, bridging traditional OLAP with lakehouse architecture, and the importance of integration with formats like Apache Iceberg. Sida also emphasizes the challenges of denormalization and real-time data processing in modern analytics.

12 snips
Apr 28, 2025 • 1h 13min
Exploring NATS: A Multi-Paradigm Connectivity Layer for Distributed Applications
Derek Collison, the creator of NATS and CEO of Synadia, shares insights from his impressive background at Google and VMware. He discusses how NATS revolutionizes messaging systems with innovative features like the circuit breaker pattern and Jetstream. Derek highlights NATS’s advantages in edge computing, emphasizing its resilience and data persistence capabilities. He also addresses the challenges of open-source technology and shares thoughts on the future of connectivity in modern distributed systems, proving NATS's versatility across various industries.

37 snips
Apr 21, 2025 • 57min
Advanced Lakehouse Management With The LakeKeeper Iceberg REST Catalog
Victor Kessler, co-founder of Vakama and developer of Lakekeeper, dives into the world of advanced lakehouse management with a focus on Apache Iceberg. He discusses the pivotal role of metadata in data actionability and the evolution of data catalogs. Victor highlights innovative features of Lakekeeper, like integration with OpenFGA for access control and its deployment using Rust on Kubernetes. He also addresses the challenges of migrating data catalogs and the importance of community involvement in open-source projects for better data management.

65 snips
Apr 12, 2025 • 40min
Simplifying Data Pipelines with Durable Execution
In this engaging conversation, Jeremy Edberg, CEO of DBOS and former tech lead at companies like Reddit and Netflix, discusses the vital concept of durable execution in data systems. He reveals how DBOS's serverless platform enhances local resilience and simplifies the intricacies of data pipelines. Jeremy emphasizes the significance of version management for long-running workflows and introduces the Transact library, which boosts reliability and efficiency. This episode is a treasure trove for anyone interested in optimizing data workflows and reducing operational headaches.

22 snips
Mar 30, 2025 • 44min
Overcoming Redis Limitations: The Dragonfly DB Approach
Roman Gershman, CTO and founder of Dragonfly DB, shares his journey from Google to creating a high-speed alternative to Redis. He dives into the challenges of developing in-memory databases, focusing on performance, scalability, and cost efficiency. Roman discusses operational complexities users face, while highlighting Dragonfly's compatibility with Redis and innovations like SSD tiering. He also explores programming trade-offs between C++ and Rust, emphasizing adaptability in database development and the importance of community feedback in shaping future advancements.

23 snips
Mar 24, 2025 • 53min
Bringing AI Into The Inner Loop of Data Engineering With Ascend
Sean Knapp, Founder and CEO of Ascend.io, shares his expertise in data engineering and AI's transformative role. He discusses how AI can streamline workflows, alleviate burdens for data engineers, and enhance productivity by automating tasks. Sean highlights challenges like data governance and the integration of AI into existing systems. The conversation also touches on bridging the gap between junior and senior engineers using AI as a collaborative tool, as well as the future potential of AI to revolutionize data engineering processes.

18 snips
Mar 16, 2025 • 52min
Astronomer's Role in the Airflow Ecosystem: A Deep Dive with Pete DeJoy
Pete DeJoy, co-founder and product lead at Astronomer, shares his extensive experience with Airflow, discussing its evolution and upcoming enhancements in Airflow 3. He highlights Astronomer's commitment to improving data operations and community involvement. The conversation dives into the critical role of data observability through Astra Observe, innovative use cases like the Texas Rangers in-game analytics, and the shifting landscape of data engineering roles, emphasizing collaboration and advanced tooling in the modern data ecosystem.

Mar 8, 2025 • 56min
Accelerated Computing in Modern Data Centers With Datapelago
SummaryIn this episode of the Data Engineering Podcast Rajan Goyal, CEO and co-founder of Datapelago, talks about improving efficiencies in data processing by reimagining system architecture. Rajan explains the shift from hyperconverged to disaggregated and composable infrastructure, highlighting the importance of accelerated computing in modern data centers. He discusses the evolution from proprietary to open, composable stacks, emphasizing the role of open table formats and the need for a universal data processing engine, and outlines Datapelago's strategy to leverage existing frameworks like Spark and Trino while providing accelerated computing benefits.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Rajan Goyal about how to drastically improve efficiencies in data processing by re-imagining the system architectureInterviewIntroductionHow did you get involved in the area of data management?Can you start by outlining the main factors that contribute to performance challenges in data lake environments?The different components of open data processing systems have evolved from different starting points with different objectives. In your experience, how has that un-planned and un-synchronized evolution of the ecosystem hindered the capabilities and adoption of open technologies?The introduction of a new cross-cutting capability (e.g. Iceberg) has typically taken a substantial amount of time to gain support across different engines and ecosystems. What do you see as the point of highest leverage to improve the capabilities of the entire stack with the least amount of co-ordination?What was the motivating insight that led you to invest in the technology that powers Datapelago?Can you describe the system design of Datapelago and how it integrates with existing data engines?The growth in the generation and application of unstructured data is a notable shift in the work being done by data teams. What are the areas of overlap in the fundamental nature of data (whether structured, semi-structured, or unstructured) that you are able to exploit to bridge the processing gap?What are the most interesting, innovative, or unexpected ways that you have seen Datapelago used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Datapelago?When is Datapelago the wrong choice?What do you have planned for the future of Datapelago?Contact InfoLinkedInParting QuestionFrom your perspective, what is the biggest gap in the tooling or technology for data management today?LinksDatapelagoMIPS ArchitectureARM ArchitectureAWS NitroMellanoxNvidiaVon Neumann ArchitectureTPU == Tensor Processing UnitFPGA == Field-Programmable Gate ArraySparkTrinoIcebergPodcast EpisodeDelta LakePodcast EpisodeHudiPodcast EpisodeApache GlutenIntermediate RepresentationTuring CompletenessLLVMAmdahl's LawLSTM == Long Short-Term MemoryThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA