#034 Rethinking Search Inside Postgres, From Lexemes to BM25

Dec 5, 2024

Philippe Noël, Founder and CEO of ParadeDB, dives into the revolutionary shift in search technology with his open-source PostgreSQL extension. He discusses how ParadeDB eliminates the need for separate search clusters by enabling search directly within databases, simplifying architecture and enhancing cost-efficiency. The conversation explores BM25 indexing, maintaining data normalization, and the advantages of ACID compliance with search. Philippe also reveals successful use cases, including Alibaba Cloud’s implementation, and practical insights for optimizing large-scale search applications.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

ParadeDB's Sweet Spot

ParadeDB excels with structured relational data in Postgres, offering strong data integrity.
For large JSON document workloads, a NoSQL search engine might be more suitable.

INSIGHT

Composable Data Systems and Their Challenges

Integrating multiple query engines like DuckDB or DataFusion within Postgres creates overhead, hindering data integrity.
Building features natively in Postgres, while more work, ensures better performance and transactional safety.

ANECDOTE

Alibaba Cloud Case Study

Alibaba Cloud, ParadeDB's largest customer, uses it within their Postgres data warehouse.
They chose ParadeDB over Elastic to offer a unified product with full-text search capabilities.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

Many companies use Elastic or OpenSearch and use 10% of the capacity.

They have to build ETL pipelines.

Get data Normalized.

Worry about race conditions.

All in all. At the moment, when you want to do search on top of your transactional data, you are forced to build a distributed systems.

Not anymore.

ParadeDB is building an open-source PostgreSQL extension to enable search within your database.

Today, I am talking to Philippe Noël, the founder and CEO of ParadeDB.

We talk about how they build it, how they integrate into the Postgres Query engines, and how you can build search on top of Postgres.

Key Insights:

Search is changing. We're moving from separate search clusters to search inside databases. Simpler architecture, stronger guarantees, lower costs up to a certain scale.

Most search engines force you to duplicate data. ParadeDB doesn't. You keep data normalized and join at query time. It hooks deep into Postgres's query planner. It doesn't just bolt on search - it lets Postgres optimize search queries alongside SQL ones.

Search indices can work with ACID. ParadeDB's BM25 index keeps Lucene-style components (term frequency, normalization) but adds Postgres metadata for transactions. Search + ACID is possible.

Two storage types matter: inverted indices for text, columnar "fast fields" for analytics. Pick the right one or queries get slow. Integers now default to columnar to prevent common mistakes.

Mixing query engines looks tempting but fails. The team tried using DuckDB and DataFusion inside Postgres. Both were fast but broke ACID compliance. They had to rebuild features natively.

Philippe Noël:

Nicolay Gerold:

00:00 Introduction to ParadeDB 00:53 Building ParadeDB with Rust 01:43 Integrating Search in Postgres 03:04 ParadeDB vs. Elastic 05:48 Technical Deep Dive: Postgres Integration 07:27 Challenges and Solutions 09:35 Transactional Safety and Performance 11:06 Composable Data Systems 15:26 Columnar Storage and Analytics 20:54 Case Study: Alibaba Cloud 21:57 Data Warehouse Context 23:24 Custom Indexing with BM25 24:01 Postgres Indexing Overview 24:17 Fast Fields and Columnar Format 24:52 Lucene Inspiration and Data Storage 26:06 Setting Up and Managing Indexes 27:43 Query Building and Complex Searches 30:21 Scaling and Sharding Strategies 35:27 Query Optimization and Common Mistakes 38:39 Future Developments and Integrations 39:24 Building a Full-Fledged Search Application 42:53 Challenges and Advantages of Using ParadeDB 46:43 Final Thoughts and Recommendations