Summary: I argue for a picture of developmental interpretability from neuroscience. A useful way to study and control frontier-scale language models is to treat their training as a sequence of physical phase transitions. Singular learning theory (SLT) rigorously characterizes singularities and learning dynamics in overparameterized models; developmental interpretability hypothesizes that large networks pass through phase transitions - regimes where behaviour changes qualitatively. The Critical Brain Hypothesis (CBH) proposes that biological neural systems may operate near phase transitions, and I claim that modern language models resemble this. Some transformer phenomena may be closely analogous: in-context learning and long-range dependency handling strengthen with scale and training distribution (sometimes appearing threshold-like), short fine-tunes or jailbreak prompts can substantially shift behaviour on certain evaluations, and “grokking” shows delayed generalization with qualitative similarities to critical slowing-down. Finally, I outline tests and implications under this speculative lens.

Phase Transitions

Developmental interpretability proposes that we use [...]

---

Outline:

(01:19) Phase Transitions

(02:22) The Critical Brain Hypothesis (CBH) and Alignment

(06:55) Critical Daydreaming

(08:33) Critical Surfing and some practical thoughts

(11:20) Collaborate with us

The original text contained 4 footnotes which were omitted from this narration.

---

First published:
September 9th, 2025

Source:
https://www.lesswrong.com/posts/Ntdwc5nrPGZMicAWz/large-language-models-and-the-critical-brain-hypothesis-1

---

Narrated by TYPE III AUDIO.

“Large Language Models and the Critical Brain Hypothesis” by David Africa

Phase Transitions