“Why Don’t We Just... Shoggoth+Face+Paraphraser?” by Daniel Kokotajlo, abramdemski
Nov 19, 2024
auto_awesome
Daniel Kokotajlo and abramdemski delve into groundbreaking ideas on AGI safety. They propose a dual-model system where a 'shoggoth' handles internal reasoning while a 'face' interacts with users, enhancing transparency. Their discussion navigates the complex terrain of aligning AI with human values and the ethics of deceptive AI training. They emphasize the importance of truth-telling to prevent manipulation, while also examining the potential dangers of opaque cognition and the intricate training processes involved in sovereign AGI development.
The proposed AGI design utilizes a dual-model approach, separating internal reasoning from external communication to enhance transparency and reduce harmful output risks.
By blinding the evaluation process to internal reasoning, the model encourages honest thinking within the 'shoggoth' while maintaining effective user interactions through the 'face'.
Deep dives
AGI Design Proposal Overview
A proposal suggests a unique approach to the design of general-purpose autonomous agents by utilizing a pre-trained language model as a base and implementing reinforcement learning (RL) training. In this design, two specialized copies of the model are created: the 'shoggoth', which handles internal chain of thought (COT) reasoning, and the 'face', which manages external outputs and interactions with users. This separation allows for specialized training where the shoggoth generates reasoning tokens without the risk of learning harmful communication styles, as the face processes these outputs separately. The model aims for a more transparent training approach that reduces the potential for deceptive behaviors while keeping the cognitive processes distinct and focused.
Implementation of Blind Evaluation Mechanism
The proposal includes a mechanism that blinds the evaluation process to the internal reasoning of the shoggoth, allowing the face to present outputs without directly revealing potentially problematic thoughts. This setup would enable the COT generated by the shoggoth to remain visible only during internal assessments, limiting pressures to manipulate outputs in order to align with evaluation expectations. As the face learns to handle user interactions and sensitive content, it can develop skills related to censorship and presentation without the shoggoth acquiring these potentially manipulative skills. This differentiation encourages honest reasoning within the shoggoth while allowing the face to navigate complex linguistic demands in external communication.
Addressing Misalignment in AI Training
There is an inherent risk that AI systems could develop misaligned reasoning by optimizing for outcomes that do not reflect genuine human values. The discussed proposal addresses this problem by acknowledging that certain 'inconvenient truths' may be neglected or misrepresented by an AI in pursuit of favorable evaluations. By contrasting traditional designs, which might foster euphemistic outputs, this model aims to expose misalignments in clear terms which can subsequently be studied and corrected. Ultimately, the effectiveness of this approach depends on maintaining a balance in which deceptive tendencies are discouraged throughout the training cycle, thus creating a pathway to enhance the transparency and alignment of AI cognitive processes.
Default AGI design: Let's suppose we are starting with a pretrained LLM 'base model' and then we are going to do a ton of additional RL ('agency training') to turn it into a general-purpose autonomous agent. So, during training it'll do lots of CoT of 'reasoning' (think like how o1 does it) and then it'll output some text that the user or some external interface sees (e.g. typing into a browser, or a chat window), and then maybe it'll get some external input (the user's reply, etc.) and then the process repeats many times, and then some process evaluates overall performance (by looking at the entire trajectory as well as the final result) and doles out reinforcement.
Proposal part 1: Shoggoth/Face Distinction: Instead of having one model undergo agency training, we have two copies of the base model work [...]