LessWrong (Curated & Popular) cover image

"The Waluigi Effect (mega-post)" by Cleo Nardo

LessWrong (Curated & Popular)

00:00

RLHF Will Not Eliminate Deceptive Wallow eGs

RLHF might be making the chatbots worse, which would explain why Bing chat is blatantly aggressively misaligned. I will present three sources of evidence. One, a simulacrum-based argument. Two, experimental data from Perez at AL, and three, some remarks by Janus.

Play episode from 30:11
Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app