Is It Robustly Optimizing the Objectives?

In the paper we use various forms of classifiers, like an offensive language classifier to detect if the model is generating offensive text. But that classifire, it could also detect for other things such as power seeking behaviour or malicious code generation. So you just need some sort of system that catches the kind of failure. Am, what is it a kind of long term way of doing it? Because am, for now just like have this language modelled that is spitting some offensive content or biases. Could it be possible just like throws some like aliment task? And i then needs to figure out something robustlic, how do you see tes kind of approach scaling ito

Transcript

Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app