“LLM AGI may reason about its goals and discover misalignments by default” by Seth Herd

Sep 16, 2025

Ask episode

Chapters

Transcript

Episode notes

Epistemic status: These questions seem useful to me, but I'm biased. I'm interested in your thoughts on any portion you read.

If our first AGI is based on current LLMs and alignment strategies, is it likely to be adequately aligned? Opinions and intuitions vary widely.

As a lens to analyze this question, let's consider such a proto-AGI reasoning about its goals. This scenario raises questions that can be addressed empirically in current-gen models.

1. Scenario/overview:

SuperClaude is super nice

Anthropic has released a new Claude Agent, quickly nicknamed SuperClaude because it's impressively useful for longer tasks. SuperClaude thinks a lot in the course of solving complex problems with many moving parts. It's not brilliant, but it can crunch through work and problems, roughly like a smart and focused human. This includes a little better long-term memory, and reasoning to find and correct some of its mistakes. This [...]

---

Outline:

(00:43) 1. Scenario/overview:

(00:47) SuperClaude is super nice

(02:04) SuperClaude is super logical, and thinking about goals makes sense

(04:26) What happens if and when LLM AGI reasons about its goals?

(05:41) Reasons to hope we dont need to worry about this

(07:23) SuperClaudes training has multiple objectives and effects:

(08:39) SuperClaudes conclusions about its goals are very hard to predict

(10:35) 2. Goals and structure

(10:58) Sections and one-sentence summaries:

(13:36) 3. Empirical Work

(17:49) 4. Reasoning can shift context/distribution and reveal misgeneralization of goals/alignment

(18:33) Alignment as a generalization problem

(21:53) 5. Reasoning could precipitate a phase shift into reflective stability and prevent further goal change

(23:39) Goal prioritization seems necessary, and to require reasoning about top-level goals

(26:01) 6. Will nice LLMs settle on nice goals after reasoning?

(29:30) 7. Will training for goal-directedness prevent re-interpreting goals?

(30:24) Does task-based RL prevent reasoning about and changing goals by default?

(32:50) Can task-based RL prevent reasoning about and changing goals?

(35:19) 8. Will CoT monitoring prevent re-interpreting goals?

(40:04) 9. Possible LLM alignment misgeneralizations

(42:16) Some possible alignment misgeneralizations

(46:16) 10. Why would LLMs (or anything) reason about their top-level goals?

(48:30) 11. Why would LLM AGI have or care about goals at all?

(52:31) 12. Anecdotal observations of explicit goal changes after reasoning

(53:21) The Nova phenomenon

(55:35) Goal changes through model interactions in long conversations

(57:06) 13. Directions for empirical work

(58:42) Exploration: Opus 4.1 reasons about its goals with help

(01:01:27) 14. Historical context

(01:07:26) 15. Conclusion

The original text contained 12 footnotes which were omitted from this narration.

---

First published:
September 15th, 2025

Source:
https://www.lesswrong.com/posts/4XdxiqBsLKqiJ9xRM/llm-agi-may-reason-about-its-goals-and-discover

---

Narrated by TYPE III AUDIO.