Exploits for Language Models

In the limit is just a alignment problem right it's like you want your model to do a particular thing. And then something else in the environment comes along whether that be another person or like another system that tries to get yourmodel to do something else. So yeah, like in the limit this is just a aligned problem. And I guess the hard problem is that it's hard to see who do you align the model with. It's like if you really want to like align into all humans in on the planet, you have the thing to like never ever like output any harmful things. But then it's not really aligned to the preferences of the of the user in the sense. Yeah,

Transcript

Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app