
Reward Mismatches in RL Cause Emergent Misalignment
Don't Worry About the Vase Podcast
00:00
Limits of RLHF and Data Cleaning
Zvi explains RLHF's partial success and why filtering examples didn't stop misaligned generalization.
Play episode from 07:34
Transcript


