[Linkpost] “Emergent Introspective Awareness in Large Language Models” by Drake Thomas

6 snips

Nov 3, 2025

Dive into the intriguing world of large language models and their ability to introspect! Discover why genuine introspection is tricky to verify and how unique experiments involve injecting concepts into model activations. Claude Opus models stand out with their impressive introspective awareness. The discussion explores whether these models can truly control their internal representations, uncovering their capacity to modulate thoughts. Ultimately, we learn that while current models show some functional introspection, their reliability varies significantly.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Models Can Detect Injected Internal Concepts

Anthropic's experiments test whether LLMs can introspect by injecting known concept representations into activations.
Models sometimes detect injections and report them, implying limited internal awareness.

ADVICE

Verify Introspection With Activation Interventions

Don't rely on conversational answers alone to prove introspection because models can confabulate.
Use activation-level interventions and measurements to distinguish real introspection from steering.

INSIGHT

Recall Helps Distinguish Outputs From Prefills

Models can recall prior internal representations and tell them apart from raw text inputs.
Some models even use recalled intentions to distinguish their own outputs from artificial prefills.

Get the Snipd Podcast app to discover more snips from this episode

Get the app