The Perverse Argument for Negative Reward in Reinforcement Learning

#151 – Ajeya Cotra on accidentally teaching AI models to deceive us

80,000 Hours Podcast

NOTE

The Perverse Argument for Negative Reward in Reinforcement Learning

The argument explores the idea of negative rewards in reinforcement learning, suggesting that giving negative rewards to a certain outcome or process can actually lead to improved behavior. It questions whether the common practice of providing positive and negative reinforcement always leads to desired outcomes. The discussion highlights a scenario where negative rewards can effectively steer behavior in the right direction, emphasizing the importance of understanding how motivations and time horizons influence an agent's decision-making process.

00:00

Transcript

Play full episode

Transcript

Episode notes

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.