Is There a Waluigi Effect?

After you train an LLM to satisfy a desirable property, P, then it's easier to elicit the chatbot into satisfying the exact opposite of property P. For example, Alice hates croissants and would never eat one; Bob loves bacon and eggs. Using that prompt, the resulting chatbot will be a superposition of two different simulacra. The first is anti-cresant while the second is pro- cresant. There are not many alternate ones that you can swap between.

Transcript

Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app