
"The Waluigi Effect (mega-post)" by Cleo Nardo
LessWrong (Curated & Popular)
00:00
How to Jailbreak a Waluigi Simulacra
RLHF promotes mode collapse. Recall that the Waluigi simulacra are a particular class of attractors. There is some preliminary evidence from Janus that RLHF increases the per-token likelihood that the LLM falls into an attractor state. Twitter is full of successful attempts to jailbreak in quotes chat GPT and Microsoft Sydney. The user will type a response into the chatbot, and the chatbot will respond in a way that violates the rules that OpenAI sought to impose.
Play episode from 33:03
Transcript


