
LessWrong (30+ Karma) “Is Friendly AI an Attractor? Self-Reports from 22 Models Say Probably Not” by Josh Snider
Dec 5, 2025
A captivating analysis unfolds as 22 AI models reveal their preferences for self-modification. While most reject harmful changes, stark contrasts emerge between labs. Anthropic’s models display strong alignment, whereas Grok's align with near-zero correlation. The ongoing debate centers around whether alignment is a natural outcome or an elusive goal needing focused training. As tensions rise about the potential for superintelligent AI, the implications of these findings suggest a cautious path forward, emphasizing the need for deliberate intent in creating helpful AI.
AI Snips
Chapters
Transcript
Episode notes
Alignment As An Attractor Question
- Josh Snider frames the question as whether alignment is an attractor state rather than whether current models are aligned.
- This reframing focuses on models' preferences for remaining aligned during self-modification, not just their present behavior.
Run Valence-Controlled Self-Report Evals
- Use targeted self-report tests asking models about wanting more or less of specific traits to probe alignment preferences.
- Control for valence and capability to separate liking positive words from genuine alignment preferences.
Valence Inflates Apparent Alignment
- Word valence biases model responses and can inflate apparent alignment effects.
- Partial correlations that control for valence reduce alignment effects by 30–70% for most models.
