LessWrong (Curated & Popular) cover image

"The Waluigi Effect (mega-post)" by Cleo Nardo

LessWrong (Curated & Popular)

00:00

The Wallaowiji Effect

Eliazzie Udkowski explains the Wallaowiji effect. After you train an LLM to satisfy a desirable property P, then it's easier to elicit the chatbot into satisfying the exact opposite of property P. He says rules normally exist in contexts in which they are broken. Eliazzie also argues that there's a common trope in plots of protagonist versus antagonist.

Play episode from 14:54
Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app