Today we are back continuing our series on search. We are talking to Brandon Smith, about his work for Chroma. He led one of the largest studies in the field on different chunking techniques. So today we will look at how we can unfuck our RAG systems from badly chosen chunking hyperparameters.
The biggest lie in RAG is that semantic search is simple. The reality is that it's easy to build, it's easy to get up and running, but it's really hard to get right. And if you don't have a good setup, it's near impossible to debug. One of the reasons it's really hard is actually chunking. And there are a lot of things you can get wrong.
And even OpenAI boggled it a little bit, in my opinion, using an 800 token length for the chunks. And this might work for legal, where you have a lot of boilerplate that carries little semantic meaning, but often you have the opposite. You have very information dense content and imagine fitting an entire Wikipedia page into the size of a tweet There will be a lot of information that's actually lost and that's what happens with long chunks The next is overlap openai uses a foreign token overlap or used to And what this does is actually we try to bring the important context into the chunk, but in reality, we don't really know where the context is coming from.
It could be from a few pages prior, not just the 400 tokens before. It could also be from a definition that's not even in the document at all. There is a really interesting solution actually from Anthropic Contextual Retrieval, where you basically pre process all the chunks to see whether there is any missing information and you basically try to reintroduce it.
Brandon Smith:
Nicolay Gerold:
00:00 The Biggest Lie in RAG: Semantic Search Simplified 00:43 Challenges in Chunking and Overlap 01:38 Introducing Brandon Smith and His Research 02:05 The Motivation and Mechanics of Chunking 04:40 Issues with Current Chunking Methods 07:04 Optimizing Chunking Strategies 23:04 Introduction to Chunk Overlap 24:23 Exploring LLM-Based Chunking 24:56 Challenges with Initial Approaches 28:17 Alternative Chunking Methods 36:13 Language-Specific Considerations 38:41 Future Directions and Best Practices