Why Does Not Rise to the Top?

There are a bunch of theories floating around, including my personal favorite, which is called the Waluigi effect. The more that you try to train a language model to be good, like the more you tell it about what it looks like for an AI model to behave in a pro social and cooperative way, the more you are also teaching it how to do the opposite of that. For every Luigi, there is also a Waluigi. It does point to how hard it is to actually understand and interpret what these models are doing. But I think it's a really interesting possibility that maybe by trying so hard to train these models to behave in just, you know, the right agreeable ways

Transcript

Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app