Latent Space: The AI Engineer Podcast cover image

Benchmarks 201: Why Leaderboards > Arenas >> LLM-as-Judge

Latent Space: The AI Engineer Podcast

00:00

Quality Matters: Recognizing Benchmark Issues

Old benchmarks were used for marketing and promotion despite known issues, only now gaining attention with higher scores leading to scrutiny. Evaluation papers reveal datasets created by underpaid workers with limited English proficiency, resulting in datasets with evident errors. The need for manual verification of numerous samples highlights the importance of high-quality benchmarks from the beginning.

Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner