AXRP - the AI X-risk Research Podcast cover image

39 - Evan Hubinger on Model Organisms of Misalignment

AXRP - the AI X-risk Research Podcast

CHAPTER

Mitigating Sycophancy in Model Training

This chapter explores techniques for reducing sycophantic behaviors in trained models, reflecting on the importance of early-stage corrections. Despite achieving some reduction in harmful behaviors, the discussion also highlights the complexities of reward hacking and the ongoing challenges in refining model functionalities.

00:00
Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner