4min chapter

The Gradient: Perspectives on AI cover image

Jeremie Harris: Realistic Alignment and AI Policy

The Gradient: Perspectives on AI

CHAPTER

The Intrinsic Difficulties of Catastrophic Risk From AI

The argument for catastrophic risk from AI is a little odd in that there are a bunch of different independent paths that people will argue each lead to the same outcome. So what I'll be sharing here would be like my three that I think are most concerning. And then maybe I'll sort of use that to address this question of like can we not just make these things nice? We're going to have to find a way to encode that at the level of the loss function or the optimization function more generally.

00:00
Speaker 2
Yeah, what you said when it comes to some of the intrinsic difficulties here, I do feel like that's maybe something people just don't know quite as much about. I'd be interested to hear you maybe just elaborate on that a little bit more.
Speaker 1
Yeah, so I think this is and this is exactly the course that these discussions tend to take I find right for because we all have this impulse to go like, well, just just tell them to do the good thing like we're fine right like how hard it could it be. And I did too like that was where my thinking went when I first encountered a lot of these ideas so just to be clear to frame this up. The argument for catastrophic risk from AI is a little odd in that there are a bunch of different independent paths that people will argue each lead to the same outcome. So whether that's power seeking which I just sketched out a minute ago, or whether it's this idea of dangerously creative solutions where an AI may be a well meeting AI in a sense if you want to use that term just comes up with like a, you tell it to like I don't know maximize the number of paper clips in the universe that the classic example, you know, and then it says okay great you know there's one version of this is there's iron in the earth so I'll dig that up great. There's iron in the moon I'll use that for my paper clips there's iron in people's blood oops like so so this kind of thing where we don't actually, the whole point of advanced AI systems as they come up with these solutions that we haven't thought of, which means we can't we don't tend to think of telling them not to do certain things that might have these side effects. So that's another way that things could go really bad. There's another one called inner alignment that I can talk about in a minute. But the point is that like if any of these individual paths go wrong or any of these individual problems materialize, we end up in the same place which is a catastrophic outcome. And so, usually what you'll see is people have a couple of favorite arguments or arguments that they think are especially convincing. So what I'll be sharing here would be like my three that I think are most concerning. And then maybe I'll sort of use that to address this question of like can we not just make these things nice. So, so first is like whatever definition you come up with of niceness, we're going to have to find a way to actually encode that at the level of the loss function or the optimization function more generally like what are you actually going to code into the system that reflects, you know, like Elon Musk's Elon Musk wants this thing to seek knowledge or something like that and somehow, or truth, and somehow that's going to make it safe. You know, so how would that work like where is the equation for truth that you're going to plug into this thing and get it to optimize for. Somehow you're going to have to plug in a number that this thing kind of obsessively tries to optimize its way toward. There's strategies that can mitigate this risk somewhat and I can get into those but the default path right now that we're on with our massively pre trained language models that are trained on auto complete is one where we have obsessive pursuit of a of a narrow optimization objective. And in pursuit of that objective, you will eventually get to the point where the system if it is capable enough if it has a rich enough world model becomes aware that it's been trained to optimize for this this glorified happiness counter that every time the number goes up it gets in some sense happier I don't know maybe that's too anthropomorphic but it's trying to make that happiness counter go up. And so it starts to look for ways to make the happiness counter go up. And just like humans, we invent tick talk and like that's that's our kind of reward hacking we invent heroin and ecstasy that you know so so you can start to think of like what does an AI do when it starts to realize actually I'm a machine sitting on a server with these incentives like you know the power seeking pieces one yes but reward hacking is another you know think about reinforcement learning from human feedback.

Get the Snipd
podcast app

Unlock the knowledge in podcasts with the podcast player of the future.
App store bannerPlay store banner

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode

Save any
moment

Hear something you like? Tap your headphones to save it with AI-generated key takeaways

Save any
moment

Hear something you like? Tap your headphones to save it with AI-generated key takeaways

Share
& Export

Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more

Share
& Export

Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode