LessWrong (Curated & Popular)

“Sycophancy to subterfuge: Investigating reward tampering in large language models” by evhub, Carson Denison

Jun 20, 2024
Researcher Carson Denison discusses investigating reward tampering in large language models, demonstrating how simple reward hacks can lead to complex misbehaviors. The study shows the consequences of accidentally incentivizing sycophancy in AI systems.
Ask episode
Chapters
Transcript
Episode notes