LessWrong (Curated & Popular) cover image

"When can we trust model evaluations?" bu evhub

LessWrong (Curated & Popular)

00:00

The Importance of RL in Exploration Hacking

RL might still be quite helpful here as a form of bootstrapping. If you can only generate weak examples, but want to see whether the model is capable of generating stronger examples, you could first supervise fine-tune on weak examples, then RL fine-t tune to try to get the model to produce stronger examples. Unfortunately, especially in the cases where you might want to use this style of evaluation as an alignment evaluation rather than a capabilities evaluation, telling the difference between cases where humans can demonstrate the behavior and cases where they can't could be quite easy.

Play episode from 09:21
Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app