
#66 – Michael Cohen on Input Tampering in Advanced RL Agents
Hear This Idea
00:00
The Argument for Reward Is Not the Optimization Target
There are a couple posts like the alignment forum or somewhere else. So one is called reward is not the optimization target. And then someone wrote a follow up called models don't quote unquote get reward. Alex Turner: The argument makes a ton of sense when I imagine RL agents as in some sense like wanting to get rewards over like as many time steps as possible. But maybe there's a sense in which this is actually just like not quite accurate.
Play episode from 01:21:39
Transcript


