Gradient Hacking and the Superhuman Capacity Regime

Gradient hacking is a way in which a model could fail to be successfully fine-tuned on some task. We can probably fully trust evaluations in this category until we are well into the superhuman capability regime. Current models for the most part aren't yet capable enough of causing catastrophe even if they wanted to, so right now we can mostly use capabilities evaluations to gauge model safety. But as models become more capable, that will cease to be true and we'll need some mechanism of robustly evaluating alignment.

Play episode from 13:55

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app