AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
Is That a Red Flag?
As these models get more and more complicated, they're going to develop a bigger and bigger structure of associations with different states in the world. So if I've got an agent that has sometimes been rewarded for being obedient, but also has sometimes gotten some rewards from being deceptive - should that set off red flags? We can't just keep training them until these correlations go away because of the phenomenon I'm going to talk about next: situational awareness.