
Special: Jaan Tallinn on Pausing Giant AI Experiments
Future of Life Institute Podcast
00:00
The Deception Threshold in AI Design
The shape of the alignment problem has become much clearer over the last decade. The most scary one is basically AI realizing that it is being trained and then just like acting out the goal that you're training it for in order to kind of be selected and eventually escape the box. I've seen published results as low as like 11 to nine ratio, where one is preferred to the other. Even GPT four to 3.5 is just 70 30 in terms of preference. People prefer 3.5 in a head to head comparison, which kind of blows my mind given how qualitatively better it seems to be for is. And so we're exploitable, right? We're everybody kind of knows
Transcript
Play full episode