#037 Chunking for RAG: Stop Breaking Your Documents Into Meaningless Pieces

18 snips

Jan 3, 2025

Brandon Smith, a research engineer at Chroma known for his extensive work on chunking techniques for retrieval-augmented generation systems, shares his insights on optimizing semantic search. He discusses the common misconceptions surrounding chunk sizes and overlap, highlighting the challenges of maintaining context in dense content. Smith criticizes existing strategies, such as OpenAI's 800-token chunks, and emphasizes the importance of coherent parsing. He also introduces innovative approaches to enhance contextual integrity in document processing, paving the way for improved information retrieval.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Motivation for Chunking

Chunking in retrieval-augmented generation (RAG) is necessary because of embedding model limitations.
Ideally, systems would extract relevant tokens directly, but current models require input strings and context.

INSIGHT

Chunking Issue: Lack of Metric

A key issue with chunking before Brandon Smith's research was the lack of a good comparison metric.
This made it difficult to determine the effectiveness of different chunking methods.

ADVICE

Semantically Coherent Chunks

Strive for semantically coherent chunks that are self-contained.
The cluster semantic chunker achieves this by grouping similar embedded chunks.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

Today we are back continuing our series on search. We are talking to Brandon Smith, about his work for Chroma. He led one of the largest studies in the field on different chunking techniques. So today we will look at how we can unfuck our RAG systems from badly chosen chunking hyperparameters.

The biggest lie in RAG is that semantic search is simple. The reality is that it's easy to build, it's easy to get up and running, but it's really hard to get right. And if you don't have a good setup, it's near impossible to debug. One of the reasons it's really hard is actually chunking. And there are a lot of things you can get wrong.

And even OpenAI boggled it a little bit, in my opinion, using an 800 token length for the chunks. And this might work for legal, where you have a lot of boilerplate that carries little semantic meaning, but often you have the opposite. You have very information dense content and imagine fitting an entire Wikipedia page into the size of a tweet There will be a lot of information that's actually lost and that's what happens with long chunks The next is overlap openai uses a foreign token overlap or used to And what this does is actually we try to bring the important context into the chunk, but in reality, we don't really know where the context is coming from.

It could be from a few pages prior, not just the 400 tokens before. It could also be from a definition that's not even in the document at all. There is a really interesting solution actually from Anthropic Contextual Retrieval, where you basically pre process all the chunks to see whether there is any missing information and you basically try to reintroduce it.

Brandon Smith:

Nicolay Gerold:

00:00 The Biggest Lie in RAG: Semantic Search Simplified 00:43 Challenges in Chunking and Overlap 01:38 Introducing Brandon Smith and His Research 02:05 The Motivation and Mechanics of Chunking 04:40 Issues with Current Chunking Methods 07:04 Optimizing Chunking Strategies 23:04 Introduction to Chunk Overlap 24:23 Exploring LLM-Based Chunking 24:56 Challenges with Initial Approaches 28:17 Alternative Chunking Methods 36:13 Language-Specific Considerations 38:41 Future Directions and Best Practices