3min chapter

The Inside View cover image

Ethan Perez–Inverse Scaling, Language Feedback, Red Teaming

The Inside View

CHAPTER

Is It Robustly Optimizing the Objectives?

In the paper we use various forms of classifiers, like an offensive language classifier to detect if the model is generating offensive text. But that classifire, it could also detect for other things such as power seeking behaviour or malicious code generation. So you just need some sort of system that catches the kind of failure. Am, what is it a kind of long term way of doing it? Because am, for now  just like have this language modelled that is spitting some offensive content or biases. Could it be possible just like throws some like aliment task? And i then needs to figure out something robustlic, how do you see tes kind of approach scaling ito

00:00

Get the Snipd
podcast app

Unlock the knowledge in podcasts with the podcast player of the future.
App store bannerPlay store banner

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode

Save any
moment

Hear something you like? Tap your headphones to save it with AI-generated key takeaways

Share
& Export

Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode