LessWrong (Curated & Popular) cover image

"When can we trust model evaluations?" bu evhub

LessWrong (Curated & Popular)

00:00

Behavioral Non-Faint Tuning Evaluations

This style of evaluation is very easy for the model to game. Since there's no training process involved in these evaluations, a model that knows it's being evaluated can just pick whatever answer it wants. For behavioral non-faint tuning evaluations to be trustworthy, you have to believe they aren't trying to gain your evaluations. Any sort of situational awareness on behalf of the model here could lead to evaluation gaming.

Play episode from 04:41
Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app