

Generative Benchmarking with Kelly Hong - #728
146 snips Apr 23, 2025
Kelly Hong, a researcher at Chroma, delves into generative benchmarking, a vital approach for evaluating retrieval systems with synthetic data. She critiques traditional benchmarks for failing to mimic real-world queries, stressing the importance of aligning LLM judges with human preferences. Kelly explains a two-step process: filtering relevant documents and generating user-like queries to enhance AI performance. The discussion also covers the nuances of chunking strategies and the differences between benchmark and real-world queries, advocating for a more systematic AI evaluation.
AI Snips
Chapters
Transcript
Episode notes
Limitations of Public Benchmarks
- Public benchmarks like MTEB do not represent real-world queries or messy production data well.
- Embedding models that perform best on benchmarks can underperform in actual applications.
Start Evaluating Early and Simply
- Use easy, approachable tools like generative benchmarking notebooks to start systematic eval of RAG systems.
- Getting familiar with evals early improves your ability to debug and improve AI system performance.
Two-step Generative Benchmarking Process
- Filtering documents before query generation ensures evaluation focuses on content users really ask about.
- Query generation then models realistic, often vague user queries to improve benchmark relevance.