Using a Language Model to Train a Red Teaming Team

The idea if shoud train it adversarily, the it my le over fit to kind of adversa openance. And and you want to have something as well, in a test time, like some kind of a big red teaming team. Ye, i think that. Part of the concern is that, lik, if you do have this model that has picked up an alternative inter objective, and it cares about something else that's different from what we want, then a that model will still want to fail on other kinds of inputs. That doesn't restrain the model's behaviour at all on other inputs that you wouldn't generate during this reteming process.

Play episode from 01:23:53

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app