RLHF Will Not Eliminate Deceptive Wallow eGs

RLHF might be making the chatbots worse, which would explain why Bing chat is blatantly aggressively misaligned. I will present three sources of evidence. One, a simulacrum-based argument. Two, experimental data from Perez at AL, and three, some remarks by Janus.

Play episode from 30:11

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app