The Problem With AI Learning to Understand Our Values

The challenge is getting these systems to share those values or just care about those things in some robust sense. It seems like thinking of this as it's surviving or not doesn't actually make sense unless it's at the very end of training. How does it come to know which weight changes are good? It has to understand the working of its own mind or something. Also, even if it can, it's not immediately obvious to me that the one where it does the thing you wanted it to do is better for it.

Play episode from 26:24

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app