DuckDB, Apache Arrow, & the Future of Data Engineering w/ Rusty Conover | S2E3

24 snips

Sep 9, 2025

Rusty Conover, a data engineering ace and prolific DuckDB extension creator, delves into the transformative power of DuckDB, emphasizing its speed and simplicity. He explains how the in-process architecture challenges traditional big data systems and explores the synergy with Apache Arrow. Rusty also shares insights on his 15 extensions, including Airport for data integration, and discusses the future of open table formats like Iceberg and Delta Lake. The conversation reveals DuckDB's potential to revolutionize analytics and replace complex ETL processes.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

ADVICE

Use Data Sketches For Large-Scale Approximation

Use data sketches for approximate analytics when exactness isn't required and memory is limited.
Sketches give distributions, heavy hitters, and quantiles cheaply across billions of rows.

INSIGHT

Arrow As A Columnar Interchange

Arrow is an in-memory columnar record-batch model that can read/write Parquet, ORC, CSV and IPC.
Arrow acts as a lingua franca between compute engines and avoids repeated copies and format conversions.

INSIGHT

Port Queries Between Engines

Different columnar engines (ClickHouse, DuckDB) have trade-offs; SQL translation tools like SQLGlot ease portability.
Consider rewiring queries between engines when a different backend yields better performance or cost.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

In this episode of The Hedgineer Podcast, host Michael Watson is joined by special guest Rusty Conover, the world's most prolific DuckDB extension builder, for a masterclass on building the next generation of real-time, large-scale data systems.

Rusty, who has an extensive career in data engineering, including at multi-manager hedge funds, pulls back the curtain on what makes DuckDB so revolutionary for developers and data engineers. They explore how its blazingly fast, in-process, C++-based architecture is challenging the big data status quo. The conversation provides a deep dive into the powerful ecosystem growing around DuckDB, from the Apache Arrow columnar format to the evolving landscape of open table formats like Iceberg, Delta Lake, and the new DuckLake.

Join them for a detailed discussion on the nitty-gritty of modern data infrastructure, whether you're building enterprise data platforms or looking for the most efficient tools for your analytics workload.

In this episode, you will learn about:

The DuckDB Revolution: What makes this "blazingly fast" in-process database a game-changer that can simplify and replace entire ETL stacks.

A Tour of DuckDB Extensions: A look inside some of the 15 extensions Rusty has built, from Airport for integrating with Apache Arrow, to Crypto, ShellFS, and TextPlot.

Diving into Apache Arrow: An explanation of the columnar in-memory data format, zero-copy operations, and the Arrow Flight RPC mechanism for efficiently moving data.

The Battle of Open Table Formats: A comparison of Iceberg, Delta Lake, and the new database-centric approach of DuckLake.

DuckDB vs. The World: How DuckDB stacks up against KDB for financial data, ClickHouse for analytics, and its role alongside large-scale compute engines like Apache Spark.

Parquet Deep Dive: The key differences between Parquet V1 and V2 and the importance of modern compression strategies and encodings.

The Future of DuckDB: A sneak peek at powerful upcoming features like time travel and the MERGE INTO statement for simplifying change data capture (CDC) pipelines.

Hosted by Michael Watson, The Hedgineer Podcast dives into AI technology and data in the hedge fund, asset management, and prop trading space.

Follow The Hedgineer Podcast:

YouTube: (https://www.youtube.com/@hedgineer)

LinkedIn: (https://www.linkedin.com/company/90976838)

Twitter: (https://x.com/hedgineering)

Instagram: (https://www.instagram.com/hedgineer/)

Don't forget to like, subscribe, and hit the notification bell to stay updated on our latest episodes!

Hedgineer.io

Hosted on Acast. See acast.com/privacy for more information.