How to Jailbreak a Waluigi Simulacra

RLHF promotes mode collapse. Recall that the Waluigi simulacra are a particular class of attractors. There is some preliminary evidence from Janus that RLHF increases the per-token likelihood that the LLM falls into an attractor state. Twitter is full of successful attempts to jailbreak in quotes chat GPT and Microsoft Sydney. The user will type a response into the chatbot, and the chatbot will respond in a way that violates the rules that OpenAI sought to impose.

Play episode from 33:03

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app