Is There a Reward Signal for Using Prompt Engineering?

The conoiseur shop prompting is good at getting divers examples, but it's limited in terms of harmful content. For any even safer model, which is like ideally what we want to get to, you might need like one in a thousand, or wone in ten thousand examples to generate some example of harmful behavior. And so then you want more targeted methods for finding, for doing this like adversarial attack. So that's basely the goal of this arl approach where you you take this language model that's like initially prompted, then you actually train it with reinforcement learning to a maximise the reward.

Transcript

Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app