
Cybersecurity and AI
The Lawfare Podcast
How to Hack Chat GPT's Content Moderation
The system that you're running into when you try to ask chat GPT to do something that we don't want it to do is happening actually at the model training stage. There's a stage called reinforcement learning with human feedback or fine-tuning, where we basically give the model examples of, no, this is productive behavior. We want you to do this. And so the systems you'rerunning into there are not largely us doing content moderation. They're actually us having trained the deployed version of the model to not respond to you.
00:00
Transcript
Play full episode
Remember Everything You Learn from Podcasts
Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.