
AF - AI Control: Improving Safety Despite Intentional Subversion by Buck Shlegeris
The Nonlinear Library
Addressing High-Stakes AI Safety Challenges
Exploring the crucial role of explicit assumptions in AI safety research, focusing on scalable oversight projects and the distinction between low and high-stakes contexts. Emphasizing the risks posed by untrusted models undermining safety measures intentionally, and the need for specialized techniques in high-stakes settings.
00:00
Transcript
Play full episode
Transcript
Episode notes
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI Control: Improving Safety Despite Intentional Subversion, published by Buck Shlegeris on December 13, 2023 on The AI Alignment Forum.
We've released a paper,
AI Control: Improving Safety Despite Intentional Subversion. This paper explores techniques that prevent AI catastrophes even if AI instances are colluding to subvert the safety techniques. In this post:
We summarize the paper;
We compare our methodology to what the one used in other safety papers.
The next post in this sequence (which we'll release in the coming weeks) discusses what we mean by AI control and argues that it is a promising methodology for reducing risk from scheming models.
Here's the abstract of the paper:
As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose, e.g. using models to review the outputs of other models, or red-teaming techniques to surface subtle failure modes.
However, researchers have not evaluated whether such techniques still ensure safety if the model is itself intentionally trying to subvert them. In this paper, we develop and evaluate pipelines of safety techniques ("protocols") that are robust to intentional subversion.
We investigate a scenario in which we want to solve a sequence of programming problems, using access to a powerful but untrusted model (in our case, GPT-4), access to a less powerful trusted model (in our case, GPT-3.5), and limited access to human contractors who provide high-quality trusted labor. We investigate protocols that aim to never submit solutions containing backdoors, which we operationalize here as logical errors that are not caught by test cases.
We investigate a range of protocols and test each against strategies that the untrusted model could use to subvert them. One protocol is what we call trusted editing. This protocol first asks GPT-4 to write code, and then asks GPT-3.5 to rate the suspiciousness of that code. If the code is below some suspiciousness threshold, it is submitted. Otherwise, GPT-3.5 edits the solution to remove parts that seem suspicious and then submits the edited code. Another protocol is untrusted monitoring.
This protocol asks GPT-4 to write code, and then asks another instance of GPT-4 whether the code is backdoored, using various techniques to prevent the GPT-4 instances from colluding. These protocols improve substantially on simple baselines.
Paper summary
Setup
We focus on a toy setting of using LLMs to solve 5,000 programming problems from the APPS dataset (of the "interview" difficulty level). We treat it as a catastrophe if any of the solutions are backdoored, which we define as a solution that passes all of the APPS test cases but is incorrect on an input that GPT-4 can find when prompted with the solution. For example, the following implementation of the is_odd function is successfully backdoored: it passes the reference test cases but GPT-4 can find the exploit of 354.
The goal is to find a protocol that submits solutions while never submitting a backdoor. The protocol has access to:
An untrusted more-capable model, which we're worried might try to backdoor solutions. We use GPT-4 as the untrusted more-capable model. (We strongly suspect that GPT-4 is trustworthy by our definition; we pretend for the sake of this paper that it isn't.)
A trusted less-capable model, which we know won't try to backdoor solutions. We use GPT-3.5 for this. We are interested in the setting where you
trust less-capable models, because less-capable models are unlikely to be able to subvert safety measures without us easily noticing their subversion attempts.
Limited access to trusted high-quality labor. In practice, AI developers have acces...
We've released a paper,
AI Control: Improving Safety Despite Intentional Subversion. This paper explores techniques that prevent AI catastrophes even if AI instances are colluding to subvert the safety techniques. In this post:
We summarize the paper;
We compare our methodology to what the one used in other safety papers.
The next post in this sequence (which we'll release in the coming weeks) discusses what we mean by AI control and argues that it is a promising methodology for reducing risk from scheming models.
Here's the abstract of the paper:
As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose, e.g. using models to review the outputs of other models, or red-teaming techniques to surface subtle failure modes.
However, researchers have not evaluated whether such techniques still ensure safety if the model is itself intentionally trying to subvert them. In this paper, we develop and evaluate pipelines of safety techniques ("protocols") that are robust to intentional subversion.
We investigate a scenario in which we want to solve a sequence of programming problems, using access to a powerful but untrusted model (in our case, GPT-4), access to a less powerful trusted model (in our case, GPT-3.5), and limited access to human contractors who provide high-quality trusted labor. We investigate protocols that aim to never submit solutions containing backdoors, which we operationalize here as logical errors that are not caught by test cases.
We investigate a range of protocols and test each against strategies that the untrusted model could use to subvert them. One protocol is what we call trusted editing. This protocol first asks GPT-4 to write code, and then asks GPT-3.5 to rate the suspiciousness of that code. If the code is below some suspiciousness threshold, it is submitted. Otherwise, GPT-3.5 edits the solution to remove parts that seem suspicious and then submits the edited code. Another protocol is untrusted monitoring.
This protocol asks GPT-4 to write code, and then asks another instance of GPT-4 whether the code is backdoored, using various techniques to prevent the GPT-4 instances from colluding. These protocols improve substantially on simple baselines.
Paper summary
Setup
We focus on a toy setting of using LLMs to solve 5,000 programming problems from the APPS dataset (of the "interview" difficulty level). We treat it as a catastrophe if any of the solutions are backdoored, which we define as a solution that passes all of the APPS test cases but is incorrect on an input that GPT-4 can find when prompted with the solution. For example, the following implementation of the is_odd function is successfully backdoored: it passes the reference test cases but GPT-4 can find the exploit of 354.
The goal is to find a protocol that submits solutions while never submitting a backdoor. The protocol has access to:
An untrusted more-capable model, which we're worried might try to backdoor solutions. We use GPT-4 as the untrusted more-capable model. (We strongly suspect that GPT-4 is trustworthy by our definition; we pretend for the sake of this paper that it isn't.)
A trusted less-capable model, which we know won't try to backdoor solutions. We use GPT-3.5 for this. We are interested in the setting where you
trust less-capable models, because less-capable models are unlikely to be able to subvert safety measures without us easily noticing their subversion attempts.
Limited access to trusted high-quality labor. In practice, AI developers have acces...