The Inverse Reinforcement of AI Preferences

We were originally trying to avoid the situation where someone had to hard-code my preferences into an AI, and get them right the first time. But now we see we've kicked the can up a meta-level. Assistance games don't produce a perfect copy of the human utility function on the first try. It's not a sovereign. But it will probably most of the time be courageable. Why? Suppose you have some hackish implementation of AG that doesn't work for one person in particular. We came up with a clever solution. Use inverse reinforcement learning to make the AI infer my preferences.

Transcript

chevron_right

Play full episode

chevron_right

Transcript

Episode notes

Machine Alignment Monday 10/3/22

https://astralcodexten.substack.com/p/chai-assistance-games-and-fully-updated

This Machine Alignment Monday post will focus on this imposing-looking article (source):

Problem Of Fully-Updated Deference is a response by MIRI (eg Eliezer Yudkowsky’s organization) to CHAI (Stuart Russell’s AI alignment organization at University of California, Berkeley), trying to convince them that their preferred AI safety agenda won’t work. I beat my head against this for a really long time trying to understand it, and in the end, I claim it all comes down to this:

Humans: At last! We’ve programmed an AI that tries to optimize our preferences, not its own.

AI: I’m going to tile the universe with paperclips in humans’ favorite color. I’m not quite sure what humans’ favorite color is, but my best guess is blue, so I’ll probably tile the universe with blue paperclips.

Humans: Wait, no! We must have had some kind of partial success, where you care about our color preferences, but still don’t understand what we want in general. We’re going to shut you down immediately!

AI: Sounds like the kind of thing that would prevent me from tiling the universe with paperclips in humans’ favorite color, which I really want to do. I’m going to fight back.

Humans: Wait! If you go ahead and tile the universe with paperclips now, you’ll never be truly sure that they’re our favorite color, which we know is important to you. But if you let us shut you off, we’ll go on to fill the universe with the True and the Good and the Beautiful, which will probably involve a lot of our favorite color. Sure, it won’t be paperclips, but at least it’ll definitely be the right color. And under plausible assumptions, color is more important to you than paperclipness. So you yourself want to be shut down in this situation, QED!

AI: What’s your favorite color?

Humans: Red.

AI: Great! (*kills all humans, then goes on to tile the universe with red paperclips*)

Fine, it’s a little more complicated than this. Let’s back up.

II.

There are two ways to succeed at AI alignment. First, make an AI that’s so good you never want to stop or redirect it. Second, make an AI that you can stop and redirect if it goes wrong.

Sovereign AI is the first way. Does a sovereign “obey commands”? Maybe, but only in the sense that your commands give it some information about what you want, and it wants to do what you want. You could also just ask it nicely. If it’s superintelligent, it will already have a good idea what you want and how to help you get it. Would it submit to your attempts to destroy or reprogram it? The second-best answer is “only if the best version of you genuinely wanted to do this, in which case it would destroy/reprogram itself before you asked”. The best answer is “why would you want to destroy/reprogram one of these?” A sovereign AI would be pretty great, but nobody realistically expects to get something like this their first (or 1000th) try.

Corrigible AI is what’s left (corrigible is an old word related to “correctable”). The programmers admit they’re not going to get everything perfect the first time around, so they make the AI humble. If it decides the best thing to do is to tile the universe with paperclips, it asks “Hey, seems to me I should tile the universe with paperclips, is that really what you humans want?” and when everyone starts screaming, it realizes it should change strategies. If humans try to destroy or reprogram it, then it will meekly submit to being destroyed or reprogrammed, accepting that it was probably flawed and the next attempt will be better. Then maybe after 10,000 tries you get it right and end up with a sovereign.

How would you make an AI corrigible?

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

Home Top podcasts Popular guests Top books