
39 - Evan Hubinger on Model Organisms of Misalignment
AXRP - the AI X-risk Research Podcast
00:00
Mitigating Sycophancy in Model Training
This chapter explores techniques for reducing sycophantic behaviors in trained models, reflecting on the importance of early-stage corrections. Despite achieving some reduction in harmful behaviors, the discussion also highlights the complexities of reward hacking and the ongoing challenges in refining model functionalities.
Transcript
Play full episode