How we tested models with 'None of the Above' substitutions

Suana details the experimental setup: selecting 100 MedQA items, clinician review, and retaining 68 NOTA-valid questions.

Play episode from 12:55

chevron_right

Transcript

chevron_right

Transcript

Episode notes

Is AI really ready to replace doctors? Stanford PhD researcher Suana reveals shocking truths about medical AI that Big Tech doesn't want you to know. When she tested leading AI models like GPT-4, Claude, and DeepSeek on modified medical questions, their accuracy plummeted by up to 40%!In this eye-opening conversation, we dive deep into:

❌ Why 95%+ accuracy on medical exams means nothing in real clinical practice

❌ How AI models fail when there's "no right answer" (which happens constantly in medicine)

❌ The dangerous gap between flashy headlines and clinical reality

✅ How doctors can safely use AI as a co-pilot (not replacement)

✅ The future of medical AI evaluation and what needs to changeSuana is a 3rd-year PhD student at Stanford in Biomedical Data Science, pioneering real-world evaluation methods for medical AI. Her research on MedELM and benchmarking is reshaping how we think about AI deployment in healthcare.🔬

Key Research Discussed:

JAMA Open publication on AI robustness in medical diagnosis

MedELM: 35-dataset benchmark suite for real clinical tasks

Why MedQA and USMLE-style tests don't reflect actual patient care

⚠️ CRITICAL TAKEAWAY: AI models are trained to always give an answer, even when "none of the above" is correct—a potentially dangerous flaw in medical decision-making.📚 Resources Mentioned:

MedELM Leaderboard (public repository available)

Research on medical AI evaluation standards

Real-world hospital deployment considerations

Timestamps:

0:00 - Introduction: Why Medical AI Evaluation is Broken

1:04 - Suana's Journey: From Computer Science to Healthcare AI

2:32 - The 3 Critical Problems with Current AI Benchmarks

8:28 - The Research: Testing AI with "None of the Above"

17:24 - Shocking Results: AI Accuracy Drops 8-40%

19:02 - Why AI Can't Say "I Don't Know"

23:10 - Take-Home Message: Use AI as Co-Pilot, Not Replacement

24:58 - Real Clinical Examples: When AI Actually Helps

28:12 - MedELM: The Future of Medical AI Evaluation

34:35 - Final Advice for Doctors, Patients & Developers

Whether you're a physician, healthcare worker, AI developer, or patient curious about medical AI, this conversation will change how you think about artificial intelligence in healthcare.

Paper link: https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2837372

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

Home Top podcasts Popular guests Top books