The Inside View cover image

Ethan Perez–Inverse Scaling, Language Feedback, Red Teaming

The Inside View

00:00

Using a Language Model to Train a Red Teaming Team

The idea if shoud train it adversarily, the it my le over fit to kind of adversa openance. And and you want to have something as well, in a test time, like some kind of a big red teaming team. Ye, i think that. Part of the concern is that, lik, if you do have this model that has picked up an alternative inter objective, and it cares about something else that's different from what we want, then a that model will still want to fail on other kinds of inputs. That doesn't restrain the model's behaviour at all on other inputs that you wouldn't generate during this reteming process.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app