3min chapter

The Inside View cover image

Tony Wang on Beating Superhuman Go AIs with Advesarial Policies

The Inside View

CHAPTER

Exploits for Language Models

In the limit is just a alignment problem right it's like you want your model to do a particular thing. And then something else in the environment comes along whether that be another person or like another system that tries to get yourmodel to do something else. So yeah, like in the limit this is just a aligned problem. And I guess the hard problem is that it's hard to see who do you align the model with. It's like if you really want to like align into all humans in on the planet, you have the thing to like never ever like output any harmful things. But then it's not really aligned to the preferences of the of the user in the sense. Yeah,

00:00

Get the Snipd
podcast app

Unlock the knowledge in podcasts with the podcast player of the future.
App store bannerPlay store banner

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode

Save any
moment

Hear something you like? Tap your headphones to save it with AI-generated key takeaways

Share
& Export

Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode