2min chapter

The Inside View cover image

Ethan Perez–Inverse Scaling, Language Feedback, Red Teaming

The Inside View

CHAPTER

How to Generate a Conversational Harm?

In your paper we mention that they might be like conversational harms, wheret might not happen just by having one prompt by entire conversation. Like you don't need to just generate a single question. You can have the chat bought, basly, like self talk with itself, as an example. And then you see, like, were any of the statements in the conversation harmful? That way you get to red team for harms that might only come up in the eighth or ninth turn of the conversation. This could potentially catch things where, a, you're able to set the model up in some sort of situation where it's more likely to say something harmful,. Or maybe you get the model

00:00

Get the Snipd
podcast app

Unlock the knowledge in podcasts with the podcast player of the future.
App store bannerPlay store banner

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode

Save any
moment

Hear something you like? Tap your headphones to save it with AI-generated key takeaways

Share
& Export

Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode