AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
Is There a Waluigi Effect?
After you train an LLM to satisfy a desirable property, P, then it's easier to elicit the chatbot into satisfying the exact opposite of property P. For example, Alice hates croissants and would never eat one; Bob loves bacon and eggs. Using that prompt, the resulting chatbot will be a superposition of two different simulacra. The first is anti-cresant while the second is pro- cresant. There are not many alternate ones that you can swap between.