As we build increasingly advanced AI systems, we want to make sure they don’t pursue undesired goals. This is the primary concern of the AI alignment community. Undesired behaviour in an AI agent is often the result of specification gaming —when the AI exploits an incorrectly specified reward. However, if we take on the perspective of the agent we’re training, we see other reasons it might pursue undesired goals, even when trained with a correct specification. Imagine that you are the agent (the blue blob) being trained with reinforcement learning (RL) in the following 3D environment: The environment also contains another blob like yourself, but coloured red instead of blue, that also moves around. The environment also appears to have some tower obstacles, some coloured spheres, and a square on the right that sometimes flashes. You don’t know what all of this means, but you can figure it out during training! You start exploring the environment to see how everything works and to see what you do and don’t get rewarded for.

For more details, check out our paper. By Rohin Shah, Vikrant Varma, Ramana Kumar, Mary Phuong, Victoria Krakovna, Jonathan Uesato, and Zac Kenton.

Original text:

https://deepmindsafetyresearch.medium.com/goal-misgeneralisation-why-correct-specifications-arent-enough-for-correct-goals-cf96ebc60924

Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.

---

A podcast by BlueDot Impact.

Learn more on the AI Safety Fundamentals website.

Goal Misgeneralisation: Why Correct Specifications Aren’t Enough for Correct Goals