LessWrong (Curated & Popular) cover image

"The Waluigi Effect (mega-post)" by Cleo Nardo

LessWrong (Curated & Popular)

00:00

How to Jailbreak a Waluigi Simulacra

RLHF promotes mode collapse. Recall that the Waluigi simulacra are a particular class of attractors. There is some preliminary evidence from Janus that RLHF increases the per-token likelihood that the LLM falls into an attractor state. Twitter is full of successful attempts to jailbreak in quotes chat GPT and Microsoft Sydney. The user will type a response into the chatbot, and the chatbot will respond in a way that violates the rules that OpenAI sought to impose.

Play episode from 33:03
Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app