Mitigations for Reward-Seeking

TYPE III AUDIO outlines approaches: hide reward signals or teach good values early to avoid direct reward-seeking.

Play episode from 14:38

chevron_right

Transcript

chevron_right

Transcript

Episode notes

2.1 Summary

In the last post, I introduced model-based RL, which is the frame we will use to analyze the alignment problem, and we learned that the critic is trained to predict reward.

I already briefly mentioned that the alignment problem is centrally about making the critic assign high value to outcomes we like and low value to outcomes we don’t like. In this post, we’re going to try to get some intuition for what values a critic may learn, and thereby also learn about some key difficulties of the alignment problem.

Section-by-section summary:

2.2 The Distributional Leap: The distributional leap is the shift from the training domain to the dangerous domain (where the AI could take over). We cannot test safety in that domain, so we need to predict how values generalize.
2.3 A Naive Training Strategy: We set up a toy example: a model-based RL chatbot trained on human feedback, where the critic learns to predict reward from the model's internal thoughts. This isn't meant as a good alignment strategy—it's a simplified setup for analysis.
2.4 What might the critic learn?: The critic learns aspects of the model's thoughts that correlate with reward. We analyze whether [...]

---

Outline:

(00:16) 2.1. Summary

(03:48) 2.2. The Distributional Leap

(05:26) 2.3. A Naive Training Strategy

(07:01) 2.3.1. How this relates to current AIs

(08:26) 2.4. What might the critic learn?

(09:55) 2.4.1. Might the critic learn to score honesty highly?

(12:35) 2.4.1.1. Aside: Contrast to the human value of honesty

(13:05) 2.5. Niceness is not optimal

(14:59) 2.6. Niceness is not (uniquely) simple

(16:02) 2.6.1. Anthropomorphic Optimism