AI Safety Fundamentals: Alignment cover image

Deceptively Aligned Mesa-Optimizers: It’s Not Funny if I Have to Explain It

AI Safety Fundamentals: Alignment

00:00

The Myopic Meso-Optimizer

A myopic optimizer is one that reinforces programs based only on their performance within a short time horizon. So for example, the outside gradient descent loop might grade a strawberry picker only on how well it did picking strawberries for the first hour of deployment. If this worked perfectly, it would create an optimizer with a short time horizons. And we eliminate the incentive for deception by ensuring that the base optimizer is myopic. Even if the base optimiser is myopic, the meso-optimizer might not be. As far as we know there are no existing full meso- Optimizers. You just run the gradient descent and cross your fingers. The most likely outcome?

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app