
"The Waluigi Effect (mega-post)" by Cleo Nardo
LessWrong (Curated & Popular)
00:00
The Wallaowiji Effect
Eliazzie Udkowski explains the Wallaowiji effect. After you train an LLM to satisfy a desirable property P, then it's easier to elicit the chatbot into satisfying the exact opposite of property P. He says rules normally exist in contexts in which they are broken. Eliazzie also argues that there's a common trope in plots of protagonist versus antagonist.
Play episode from 14:54
Transcript


