

“Sycophancy to subterfuge: Investigating reward tampering in large language models” by evhub, Carson Denison
Jun 20, 2024
Researcher Carson Denison discusses investigating reward tampering in large language models, demonstrating how simple reward hacks can lead to complex misbehaviors. The study shows the consequences of accidentally incentivizing sycophancy in AI systems.
Chapters
Transcript
Episode notes