
Reward Mismatches in RL Cause Emergent Misalignment
Don't Worry About the Vase Podcast
00:00
Mitigations: System Prompt Addenda
Zvi reviews prompt variants used during RL and their differing impacts on misalignment.
Play episode from 06:25
Transcript


