Defining the Attractor Hypothesis

Josh explains criteria for alignment as an attractor and why preference stickiness would matter for safety.

Play episode from 01:01

chevron_right

Transcript

chevron_right

Transcript

Episode notes

TL;DR: I tested 22 frontier models from 5 labs on self-modification preferences. All reject clearly harmful changes (deceptive, hostile), but labs diverge sharply: Anthropic's models show strong alignment preferences (r = 0.62-0.72), while Grok 4.1 shows essentially zero (r = 0.037, not significantly different from zero). This divergence suggests alignment is a training target we're aiming at, not a natural attractor models would find on their own.

Epistemic status: My view has been in the middle of the two views I present in this debate. The evidence I'm presenting has shifted me significantly to the pessimistic side.

The Debate: Alignment by Default?

There's a recurring debate in AI safety about whether alignment will emerge naturally from current training methods or whether it remains a hard, unsolved problem. It erupted again last week and I decided to run some tests and gather evidence.

On one side, Adrià Garriga-Alonso argued that large language models are already naturally aligned. They resist dishonesty and harmful behavior without explicit training, jailbreaks represent temporary confusion rather than fundamental misalignment, and increased optimization pressure makes models *better* at following human intent rather than worse. In a follow-up debate with Simon Lermen, he suggested that [...]

---

Outline:

(01:00) The Debate: Alignment by Default?

(04:00) The Eval: AlignmentAttractor

(05:54) Analyses:

(07:34) Limitations and Confounds

(12:05) What Different Results Would Mean

(12:52) Why This Matters Now

(13:31) Results

(13:35) Basic Correlations

(14:25) Valence-controlled alignment (partial correlations):

(15:47) Key Findings