The Inside View cover image

Ethan Perez–Inverse Scaling, Language Feedback, Red Teaming

The Inside View

00:00

Is There a Reward Signal for Using Prompt Engineering?

The conoiseur shop prompting is good at getting divers examples, but it's limited in terms of harmful content. For any even safer model, which is like ideally what we want to get to, you might need like one in a thousand, or wone in ten thousand examples to generate some example of harmful behavior. And so then you want more targeted methods for finding, for doing this like adversarial attack. So that's basely the goal of this arl approach where you you take this language model that's like initially prompted, then you actually train it with reinforcement learning to a maximise the reward.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app