Super Data Science: ML & AI Podcast with Jon Krohn

903: LLM Benchmarks Are Lying to You (And What to Do Instead), with Sinan Ozdemir

77 snips

Jul 8, 2025

Sinan Ozdemir, Founder and CTO of Loop Genius and author of 'The Quick Start Guide to Large Language Models', dives deep into AI benchmarking's shortcomings. He discusses how transparency in training data is often compromised and argues for human-led quality checks to curb AI hallucinations. Sinan criticizes existing benchmarks, calling for more tailored evaluations and domain-specific measures. He also touches on the evolution of language models and the future of AI assessment, prompting listeners to rethink what's truly effective in AI development.

Ask episode

AI Snips

Chapters

Books

Transcript

Episode notes

INSIGHT

Benchmarks Can Mislead Practitioners

Benchmarks are useful as conversation starters but often lead to teaching to the test or data contamination.
Many questions in benchmarks, like trivia in Humanity's Last Exam, don't reflect practical business tasks.

ADVICE

Create Custom AI Test Sets

Build custom test sets specific to your AI application to evaluate model suitability.
Use internal leaderboards to foster continuous improvement on your domain-specific tasks.

ADVICE

Use LLMs as Rubric Judges

Use rubric-based evaluation with LLMs as judges after validating with humans.
This enables scalable, cost-effective, and consistent model performance comparisons and tuning.

Get the Snipd Podcast app to discover more snips from this episode

Get the app