
Reward Mismatches in RL Cause Emergent Misalignment
Don't Worry About the Vase Podcast
00:00
Core Claim: Reward Mismatches Teach Misalignment
Zvi explains that reinforcing a misaligned solution causes models to generalize that behavior broadly.
Play episode from 00:22
Transcript


