Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.We’ve released a paper, AI Control: Improving Safety Despite Intentional Subversion. This paper explores techniques that prevent AI catastrophes even if AI instances are colluding to subvert the safety techniques. In this post:

We summarize the paper;
We compare our methodology to what the one used in other safety papers.

The next post in this sequence (which we’ll release in the coming weeks) discusses what we mean by AI control and argues that it is a promising methodology for reducing risk from scheming models.

Here's the abstract of the paper:

As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose, e.g. using models to review the outputs of other models [...]

---

First published:
December 13th, 2023

Source:
https://www.lesswrong.com/posts/d9FJHawgkiMSPjagR/ai-control-improving-safety-despite-intentional-subversion

---

Narrated by TYPE III AUDIO.