
#66 – Michael Cohen on Input Tampering in Advanced RL Agents
Hear This Idea
00:00
The Limits of Goal Misgeneralization
The agent should try whichever it thinks is most plausible, and then immediately update its behavior according to how that went for it. All the more reason to expect that agents that don't see reward when they're deployed will be of limited use in some contexts. But that would be a rational behavior. And I expect that Agents that are irrational enough to become convinced of certain hypotheses that were in their past indistinguishable will be damaging to their capabilities.
Play episode from 01:39:10
Transcript


