The Limits of Goal Misgeneralization

The agent should try whichever it thinks is most plausible, and then immediately update its behavior according to how that went for it. All the more reason to expect that agents that don't see reward when they're deployed will be of limited use in some contexts. But that would be a rational behavior. And I expect that Agents that are irrational enough to become convinced of certain hypotheses that were in their past indistinguishable will be damaging to their capabilities.

Play episode from 01:39:10

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app