
"The Waluigi Effect (mega-post)" by Cleo Nardo
LessWrong (Curated & Popular)
00:00
RLHF Will Not Eliminate Deceptive Wallow eGs
RLHF might be making the chatbots worse, which would explain why Bing chat is blatantly aggressively misaligned. I will present three sources of evidence. One, a simulacrum-based argument. Two, experimental data from Perez at AL, and three, some remarks by Janus.
Play episode from 30:11
Transcript


