How AI Is Built

#031 BM25 As The Workhorse Of Search; Vectors Are Its Visionary Cousin

5 snips
Nov 15, 2024
David Tippett, a search engineer at GitHub with expertise in BM25 and OpenSearch, delves into the efficiency of BM25 versus vector search for information retrieval. He explains how BM25 refines search by factoring in user expectations and adapting to diverse queries. The conversation highlights the challenges of vector search at scale, particularly with GitHub's massive dataset. David emphasizes that understanding user intent is crucial for optimizing search results, as it surpasses merely chasing cutting-edge technology.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

BM25 Is The Robust Default

  • BM25 is an efficient, robust retrieval function that works well out-of-domain and across many query types.
  • David Tippett calls BM25 the "OG" ranking function and emphasizes its broad applicability over vectors.
ADVICE

Start With Search Fundamentals

  • Learn BM25, reverse indexes, and core search internals before diving into vectors.
  • David Tippett advises mastering these fundamentals because they explain speed and limitations.
INSIGHT

Why BM25 Improves TF-IDF

  • BM25 adds term saturation and document-length normalization to TF-IDF to avoid overcounting repeats and long docs.
  • These adjustments make matches in short, focused documents more meaningful than in very long ones.
Get the Snipd Podcast app to discover more snips from this episode
Get the app