
"The Waluigi Effect (mega-post)" by Cleo Nardo
LessWrong (Curated & Popular)
00:00
Asymmetry of the Callback-Libler Divergence
The longer you interact with the LLM, eventually the LLM will have collapsed into a wallowiji. This is formally connected to the asymmetry of the callback-libler divergence. RLA-chef is the method used by OpenAI to coerce GPT-3, 3.5, and 4, into a smart, honest, helpful, harmless assistant. If we can't naively prompt an LLM into alignment, maybe RLHF would work instead.
Play episode from 27:25
Transcript


