Super Data Science: ML & AI Podcast with Jon Krohn

903: LLM Benchmarks Are Lying to You (And What to Do Instead), with Sinan Ozdemir

38 snips
Jul 8, 2025
Sinan Ozdemir, Founder and CTO of Loop Genius and author of 'The Quick Start Guide to Large Language Models', dives deep into AI benchmarking's shortcomings. He discusses how transparency in training data is often compromised and argues for human-led quality checks to curb AI hallucinations. Sinan criticizes existing benchmarks, calling for more tailored evaluations and domain-specific measures. He also touches on the evolution of language models and the future of AI assessment, prompting listeners to rethink what's truly effective in AI development.
Ask episode
AI Snips
Chapters
Books
Transcript
Episode notes
INSIGHT

Why AI Benchmarks Don't Tell the Whole Truth and What You Should Do Instead

AI benchmarks like MMLU and Humanity's Last Exam often mislead because labs "teach to the test," fine-tuning models to excel on these benchmarks rather than for real-world tasks. Benchmarks are also easily "contaminated" when test questions leak into training data, making it impossible to verify true performance without transparency on training datasets.

Sinan Ozdemir stresses that organizations should create their own domain-specific test sets tailored to their applications rather than relying solely on public benchmarks. He advocates for human-verified, rubric-based evaluation methods potentially augmented by LLMs to judge outputs. This approach enables teams to build internal leaderboards that reflect actual business needs rather than chasing generic scores.

Sinan also highlights that models can hallucinate (generate falsehoods) up to 40% of the time on factual benchmarks, showing that high benchmark scores do not guarantee truthfulness or reliability. The future of evaluation involves mixed methods, including decontamination of training data and continuous testing of models in real-world conditions.

INSIGHT

Benchmarks Can Mislead Practitioners

  • Benchmarks are useful as conversation starters but often lead to teaching to the test or data contamination.
  • Many questions in benchmarks, like trivia in Humanity's Last Exam, don't reflect practical business tasks.
ADVICE

Create Custom AI Test Sets

  • Build custom test sets specific to your AI application to evaluate model suitability.
  • Use internal leaderboards to foster continuous improvement on your domain-specific tasks.
Get the Snipd Podcast app to discover more snips from this episode
Get the app