The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Generative Benchmarking with Kelly Hong - #728

146 snips
Apr 23, 2025
Kelly Hong, a researcher at Chroma, delves into generative benchmarking, a vital approach for evaluating retrieval systems with synthetic data. She critiques traditional benchmarks for failing to mimic real-world queries, stressing the importance of aligning LLM judges with human preferences. Kelly explains a two-step process: filtering relevant documents and generating user-like queries to enhance AI performance. The discussion also covers the nuances of chunking strategies and the differences between benchmark and real-world queries, advocating for a more systematic AI evaluation.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Limitations of Public Benchmarks

  • Public benchmarks like MTEB do not represent real-world queries or messy production data well.
  • Embedding models that perform best on benchmarks can underperform in actual applications.
ADVICE

Start Evaluating Early and Simply

  • Use easy, approachable tools like generative benchmarking notebooks to start systematic eval of RAG systems.
  • Getting familiar with evals early improves your ability to debug and improve AI system performance.
INSIGHT

Two-step Generative Benchmarking Process

  • Filtering documents before query generation ensures evaluation focuses on content users really ask about.
  • Query generation then models realistic, often vague user queries to improve benchmark relevance.
Get the Snipd Podcast app to discover more snips from this episode
Get the app