The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Generative Benchmarking with Kelly Hong - #728

146 snips

Apr 23, 2025

Kelly Hong, a researcher at Chroma, delves into generative benchmarking, a vital approach for evaluating retrieval systems with synthetic data. She critiques traditional benchmarks for failing to mimic real-world queries, stressing the importance of aligning LLM judges with human preferences. Kelly explains a two-step process: filtering relevant documents and generating user-like queries to enhance AI performance. The discussion also covers the nuances of chunking strategies and the differences between benchmark and real-world queries, advocating for a more systematic AI evaluation.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Limitations of Public Benchmarks

Public benchmarks like MTEB do not represent real-world queries or messy production data well.
Embedding models that perform best on benchmarks can underperform in actual applications.

ADVICE

Start Evaluating Early and Simply

Use easy, approachable tools like generative benchmarking notebooks to start systematic eval of RAG systems.
Getting familiar with evals early improves your ability to debug and improve AI system performance.

INSIGHT

Two-step Generative Benchmarking Process

Filtering documents before query generation ensures evaluation focuses on content users really ask about.
Query generation then models realistic, often vague user queries to improve benchmark relevance.

Get the Snipd Podcast app to discover more snips from this episode

Get the app