2min chapter

The Lawfare Podcast cover image

Cybersecurity and AI

The Lawfare Podcast

CHAPTER

How to Hack Chat GPT's Content Moderation

The system that you're running into when you try to ask chat GPT to do something that we don't want it to do is happening actually at the model training stage. There's a stage called reinforcement learning with human feedback or fine-tuning, where we basically give the model examples of, no, this is productive behavior. We want you to do this. And so the systems you'rerunning into there are not largely us doing content moderation. They're actually us having trained the deployed version of the model to not respond to you.

00:00

Get the Snipd
podcast app

Unlock the knowledge in podcasts with the podcast player of the future.
App store bannerPlay store banner

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode

Save any
moment

Hear something you like? Tap your headphones to save it with AI-generated key takeaways

Share
& Export

Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode