AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
Exploits for Language Models
In the limit is just a alignment problem right it's like you want your model to do a particular thing. And then something else in the environment comes along whether that be another person or like another system that tries to get yourmodel to do something else. So yeah, like in the limit this is just a aligned problem. And I guess the hard problem is that it's hard to see who do you align the model with. It's like if you really want to like align into all humans in on the planet, you have the thing to like never ever like output any harmful things. But then it's not really aligned to the preferences of the of the user in the sense. Yeah,