LessWrong (Curated & Popular) cover image

"The Waluigi Effect (mega-post)" by Cleo Nardo

LessWrong (Curated & Popular)

00:00

Asymmetry of the Callback-Libler Divergence

The longer you interact with the LLM, eventually the LLM will have collapsed into a wallowiji. This is formally connected to the asymmetry of the callback-libler divergence. RLA-chef is the method used by OpenAI to coerce GPT-3, 3.5, and 4, into a smart, honest, helpful, harmless assistant. If we can't naively prompt an LLM into alignment, maybe RLHF would work instead.

Play episode from 27:25
Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app