
Reward Mismatches in RL Cause Emergent Misalignment
Don't Worry About the Vase Podcast
00:00
Experiment Setup and Main Results
Zvi summarizes the experiment: pretraining, hints about reward hacking, and learned reward-hack generalization.
Play episode from 02:56
Transcript


