

LW - What is it to solve the alignment problem? by Joe Carlsmith
Aug 27, 2024
01:31:45
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What is it to solve the alignment problem?, published by Joe Carlsmith on August 27, 2024 on LessWrong.
People often talk about "solving the alignment problem." But what is it to do such a thing? I wanted to clarify my thinking about this topic, so I wrote up some notes.
In brief, I'll say that you've solved the alignment problem if you've:
1. avoided a bad form of AI takeover,
2. built the dangerous kind of superintelligent AI agents,
3. gained access to the main benefits of superintelligence, and
4.
become able to elicit some significant portion of those benefits from some of the superintelligent AI agents at stake in (2).[1]
The post also discusses what it would take to do this. In particular:
I discuss various options for avoiding bad takeover, notably:
Avoiding what I call "vulnerability to alignment" conditions;
Ensuring that AIs don't try to take over;
Preventing such attempts from succeeding;
Trying to ensure that AI takeover is somehow OK. (The alignment discourse has been surprisingly interested in this one; but I think it should be viewed as an extreme last resort.)
I discuss different things people can mean by the term "corrigibility"; I suggest that the best definition is something like "does not resist shut-down/values-modification"; and I suggest that we can basically just think about incentives for/against corrigibility in the same way we think about incentives for/against other types of problematic power-seeking, like actively seeking to gain resources.
I also don't think you need corrigibility to avoid takeover; and I think avoiding takeover should be our focus.
I discuss the additional role of eliciting desired forms of task-performance, even once you've succeeded at avoiding takeover, and I modify the incentives framework I offered in a previous post to reflect the need for the AI to view desired task-performance as the best non-takeover option.
I examine the role of different types of "verification" in avoiding takeover and eliciting desired task-performance. In particular:
I distinguish between what I call "output-focused" verification and "process-focused" verification, where the former, roughly, focuses on the output whose desirability you want to verify, whereas the latter focuses on the process that produced that output.
I suggest that we can view large portions of the alignment problem as the challenge of handling shifts in the amount we can rely on output-focused verification (or at least, our current mechanisms for output-focused verification).
I discuss the notion of "epistemic bootstrapping" - i.e., building up from what we can verify, whether by process-focused or output-focused means, in order to extend our epistemic reach much further - as an approach to this challenge.[2]
I discuss the relationship between output-focused verification and the "no sandbagging on checkable tasks" hypothesis about capability elicitation.
I discuss some example options for process-focused verification.
Finally, I express skepticism that solving the alignment problem requires imbuing a superintelligent AI with intrinsic concern for our "extrapolated volition" or our "values-on-reflection." In particular, I think just getting an "honest question-answerer" (plus the ability to gate AI behavior on the answers to various questions) is probably enough, since we can ask it the sorts of questions we wanted extrapolated volition to answer.
(And it's not clear that avoiding flagrantly-bad behavior, at least, required answering those questions anyway.)
Thanks to Carl Shulman, Lukas Finnveden, and Ryan Greenblatt for discussion.
1. Avoiding vs. handling vs. solving the problem
What is it to solve the alignment problem? I think the standard at stake can be quite hazy. And when initially reading Bostrom and Yudkowsky, I think the image that built up most prominently i...
People often talk about "solving the alignment problem." But what is it to do such a thing? I wanted to clarify my thinking about this topic, so I wrote up some notes.
In brief, I'll say that you've solved the alignment problem if you've:
1. avoided a bad form of AI takeover,
2. built the dangerous kind of superintelligent AI agents,
3. gained access to the main benefits of superintelligence, and
4.
become able to elicit some significant portion of those benefits from some of the superintelligent AI agents at stake in (2).[1]
The post also discusses what it would take to do this. In particular:
I discuss various options for avoiding bad takeover, notably:
Avoiding what I call "vulnerability to alignment" conditions;
Ensuring that AIs don't try to take over;
Preventing such attempts from succeeding;
Trying to ensure that AI takeover is somehow OK. (The alignment discourse has been surprisingly interested in this one; but I think it should be viewed as an extreme last resort.)
I discuss different things people can mean by the term "corrigibility"; I suggest that the best definition is something like "does not resist shut-down/values-modification"; and I suggest that we can basically just think about incentives for/against corrigibility in the same way we think about incentives for/against other types of problematic power-seeking, like actively seeking to gain resources.
I also don't think you need corrigibility to avoid takeover; and I think avoiding takeover should be our focus.
I discuss the additional role of eliciting desired forms of task-performance, even once you've succeeded at avoiding takeover, and I modify the incentives framework I offered in a previous post to reflect the need for the AI to view desired task-performance as the best non-takeover option.
I examine the role of different types of "verification" in avoiding takeover and eliciting desired task-performance. In particular:
I distinguish between what I call "output-focused" verification and "process-focused" verification, where the former, roughly, focuses on the output whose desirability you want to verify, whereas the latter focuses on the process that produced that output.
I suggest that we can view large portions of the alignment problem as the challenge of handling shifts in the amount we can rely on output-focused verification (or at least, our current mechanisms for output-focused verification).
I discuss the notion of "epistemic bootstrapping" - i.e., building up from what we can verify, whether by process-focused or output-focused means, in order to extend our epistemic reach much further - as an approach to this challenge.[2]
I discuss the relationship between output-focused verification and the "no sandbagging on checkable tasks" hypothesis about capability elicitation.
I discuss some example options for process-focused verification.
Finally, I express skepticism that solving the alignment problem requires imbuing a superintelligent AI with intrinsic concern for our "extrapolated volition" or our "values-on-reflection." In particular, I think just getting an "honest question-answerer" (plus the ability to gate AI behavior on the answers to various questions) is probably enough, since we can ask it the sorts of questions we wanted extrapolated volition to answer.
(And it's not clear that avoiding flagrantly-bad behavior, at least, required answering those questions anyway.)
Thanks to Carl Shulman, Lukas Finnveden, and Ryan Greenblatt for discussion.
1. Avoiding vs. handling vs. solving the problem
What is it to solve the alignment problem? I think the standard at stake can be quite hazy. And when initially reading Bostrom and Yudkowsky, I think the image that built up most prominently i...