
LessWrong (30+ Karma) “Tests of LLM introspection need to rule out causal bypassing” by Adam Morris, Dillon Plunkett
Nov 29, 2025
The discussion dives into the intriguing concept of LLM introspection and its importance for AI safety. It emphasizes that self-reports must be grounded in actual internal states to be reliable. The hosts examine methods to test causal dependence in models, highlighting the pitfalls of causal bypassing, where reports may appear accurate without true introspection. They explore practical intervention strategies, showcasing the challenges of ensuring genuine introspection in AI systems, ultimately suggesting that grounded introspection may enhance reliability across diverse contexts.
AI Snips
Chapters
Transcript
Episode notes
Causal Grounding Is Key
- Genuine introspection requires reports to causally depend on the internal state they describe.
- Grounded self-reports are more likely to generalize to novel, out-of-context situations.
Interventions Can Mislead
- The standard test intervenes on an internal state and checks whether the model's report changes.
- But interventions can produce accurate reports via alternate causal paths, not via the state itself.
How Past Studies Tested Introspection
- Betley et al. and Plunkett et al. fine-tuned models to change tendencies and then asked for reports.
- Lindsay used concept injection and others altered prompts to test self-reporting.


