LessWrong (Curated & Popular) cover image

“Sonnet 4.5’s eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals” by Alexa Pan, ryan_greenblatt

LessWrong (Curated & Popular)

00:00

Why Hiding Evidence of Misalignment Is Dangerous

Alexa and Ryan explain how training that suppresses warning signs undermines our ability to detect and iterate on misalignment risks.

Play episode from 22:58
Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app