Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The case for a negative alignment tax, published by Cameron Berg on September 18, 2024 on LessWrong.
TL;DR:
Alignment researchers have historically predicted that building safe advanced AI would necessarily incur a significant
alignment tax compared to an equally capable but unaligned counterfactual AI.
We put forward a case here that this prediction looks increasingly unlikely given the current 'state of the board,' as well as some possibilities for updating alignment strategies accordingly.
Introduction
We recently
found that over one hundred grant-funded alignment researchers generally disagree with statements like:
alignment research that has some probability of also advancing capabilities should not be done (~70% somewhat or strongly disagreed)
advancing AI capabilities and doing alignment research are mutually exclusive goals (~65% somewhat or strongly disagreed)
Notably, this sample also predicted that the distribution would be significantly more skewed in the 'hostile-to-capabilities' direction.
See ground truth vs. predicted distributions for these statements
These results - as well as
recent
events and
related discussions - caused us to think more about our views on the relationship between capabilities and alignment work given the 'current state of the board,'[1] which ultimately became the content of this post. Though we expect some to disagree with these takes, we have been pleasantly surprised by the positive feedback we've received from discussing these ideas in person and are excited to further stress-test them here.
Is a negative alignment tax plausible (or desirable)?
Often, capabilities and alignment are framed with reference to the
alignment tax, defined as 'the extra cost [practical, developmental, research, etc.] of ensuring that an AI system is aligned, relative to the cost of building an unaligned alternative.'
The
AF/
LW wiki entry on alignment taxes notably includes the following claim:
The best case scenario is No Tax: This means we lose no performance by aligning the system, so there is no reason to deploy an AI that is not aligned, i.e., we might as well align it.
The worst case scenario is Max Tax: This means that we lose all performance by aligning the system, so alignment is functionally impossible.
We speculate in this post about a different best case scenario: a negative alignment tax - namely, a state of affairs where an AI system is actually rendered more competent/performant/capable by virtue of its alignment properties.
Why would this be even better than 'No Tax?' Given the clear existence of a
trillion
dollar attractor state towards ever-more-powerful AI, we suspect that the most pragmatic and desirable outcome would involve humanity finding a path forward that both (1) eventually satisfies the constraints of this attractor (i.e., is in fact highly capable, gets us AGI, etc.) and (2) does not pose existential risk to humanity.
Ignoring the inevitability of (1) seems practically unrealistic as an action plan at this point - and ignoring (2) could be collectively suicidal.
Therefore, if the safety properties of such a system were also explicitly contributing to what is rendering it capable - and therefore functionally causes us to navigate away from possible futures where we build systems that are capable but unsafe - then these 'negative alignment tax' properties seem more like a feature than a bug.
It is also worth noting here as an empirical datapoint here that virtually all frontier models' alignment properties have rendered them more rather than less capable (e.g., gpt-4 is far more useful and far more aligned than gpt-4-base), which is the opposite of what the 'alignment tax' model would have predicted.
This idea is somewhat reminiscent of
differential technological development, in which Bostrom suggests "[slowing] the devel...