
LessWrong (Curated & Popular) AI Control: Improving Safety Despite Intentional Subversion
Dec 15, 2023
16:53
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.We’ve released a paper, AI Control: Improving Safety Despite Intentional Subversion. This paper explores techniques that prevent AI catastrophes even if AI instances are colluding to subvert the safety techniques. In this post:
Here's the abstract of the paper:
As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose, e.g. using models to review the outputs of other models [...]
---
First published:
December 13th, 2023
Source:
https://www.lesswrong.com/posts/d9FJHawgkiMSPjagR/ai-control-improving-safety-despite-intentional-subversion
---
Narrated by TYPE III AUDIO.
- We summarize the paper;
- We compare our methodology to what the one used in other safety papers.
Here's the abstract of the paper:
As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose, e.g. using models to review the outputs of other models [...]
---
First published:
December 13th, 2023
Source:
https://www.lesswrong.com/posts/d9FJHawgkiMSPjagR/ai-control-improving-safety-despite-intentional-subversion
---
Narrated by TYPE III AUDIO.
