“Self-Other Overlap: A Neglected Approach to AI Alignment” by Marc Carauleanu, Mike Vaiana, Judd Rosenblatt, Diogo de Lucena
Aug 7, 2024
auto_awesome
Join guests Bogdan Ionut-Cirstea, Steve Byrnes, Gunnar Zarnacke, Jack Foxabbott, and Seong Hah Cho, who contribute critical insights on AI alignment. They discuss an intriguing concept called self-other overlap, which aims to optimize AI models by aligning their reasoning about themselves and others. Early experiments suggest this technique can reduce deceptive behaviors in AI. With its scalable nature and minimal need for interpretability, self-other overlap could be a game-changer in creating pro-social AI.
Self-other overlap training aims to align AI models with human values by minimizing distinctions between self and others, reducing deceptive behavior.
Early experiments show that AI agents trained with self-other overlap demonstrate significantly less deception and align more closely with non-deceptive behavior.
Deep dives
Introduction to Self-Other Overlap Training
Self-other overlap training focuses on creating similar internal representations when an AI model reasons about itself and others. This technique aims to reduce deceptive behavior by minimizing the distinctions the model makes between itself and external agents, ultimately leading to alignment with human values. Evidence suggests that neural self-other overlap in humans relates to prosociality, thus proposing its potential relevance for AI alignment. By optimizing for self-other overlap, the approach requires minimal interpretability, making it scalable and adaptable for various models with little to no disruptive effects on their capabilities.
Mechanics of Measuring Self-Other Overlap
Self-other overlap is quantified by comparing activation matrices from the AI's self-referencing and other-referencing observations. Researchers suggest that deception necessitates a distinct self-image compared to others; thus, reducing this distinction may limit deceptive behavior. During the training process, the model evaluates its responses about its values during interactions with itself versus others. This measurement indicates the alignment of the model's understanding of its goals with those of external agents, creating an environment less prone to deception.
Experimental Results and Implications
Early experiments demonstrate that AI agents trained under the self-other overlap framework exhibit significantly reduced deceptive behavior compared to traditional training methods. In controlled scenarios, these agents maintained a 92% similarity to non-deceptive baselines while diverging considerably from deceptive counterparts. Additionally, the mean self-other overlap values served as a reliable metric for identifying deceptive agents with 100% accuracy across several trials. These findings not only validate the effectiveness of the self-other overlap technique but also suggest its promising application in enhancing AI alignment in future research.
Figure 1. Image generated by DALL-3 to represent the concept of self-other overlapMany thanks to Bogdan Ionut-Cirstea, Steve Byrnes, Gunnar Zarnacke, Jack Foxabbott and Seong Hah Cho for critical comments and feedback on earlier and ongoing versions of this work.
Summary
In this post, we introduce self-other overlap training: optimizing for similar internal representations when the model reasons about itself and others while preserving performance. There is a large body of evidence suggesting that neural self-other overlap is connected to pro-sociality in humans and we argue that there are more fundamental reasons to believe this prior is relevant for AI Alignment. We argue that self-other overlap is a scalable and general alignment technique that requires little interpretability and has low capabilities externalities. We also share an early experiment of how fine-tuning a deceptive policy with self-other overlap reduces deceptive behavior in a simple RL environment. On top of that [...]
The original text contained 1 footnote which was omitted from this narration.