
"When can we trust model evaluations?" bu evhub
LessWrong (Curated & Popular)
00:00
The Importance of RL in Exploration Hacking
RL might still be quite helpful here as a form of bootstrapping. If you can only generate weak examples, but want to see whether the model is capable of generating stronger examples, you could first supervise fine-tune on weak examples, then RL fine-t tune to try to get the model to produce stronger examples. Unfortunately, especially in the cases where you might want to use this style of evaluation as an alignment evaluation rather than a capabilities evaluation, telling the difference between cases where humans can demonstrate the behavior and cases where they can't could be quite easy.
Play episode from 09:21
Transcript


