The Deception Threshold in AI Design

The shape of the alignment problem has become much clearer over the last decade. The most scary one is basically AI realizing that it is being trained and then just like acting out the goal that you're training it for in order to kind of be selected and eventually escape the box. I've seen published results as low as like 11 to nine ratio, where one is preferred to the other. Even GPT four to 3.5 is just 70 30 in terms of preference. People prefer 3.5 in a head to head comparison, which kind of blows my mind given how qualitatively better it seems to be for is. And so we're exploitable, right? We're everybody kind of knows

Transcript

Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app