How to Generate a Conversational Harm?

In your paper we mention that they might be like conversational harms, wheret might not happen just by having one prompt by entire conversation. Like you don't need to just generate a single question. You can have the chat bought, basly, like self talk with itself, as an example. And then you see, like, were any of the statements in the conversation harmful? That way you get to red team for harms that might only come up in the eighth or ninth turn of the conversation. This could potentially catch things where, a, you're able to set the model up in some sort of situation where it's more likely to say something harmful,. Or maybe you get the model

Transcript

Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app