

42 - Owain Evans on LLM Psychology
Jun 6, 2025
Owain Evans, Research Lead at Truthful AI and co-author of the influential paper 'Emergent Misalignment,' dives into the psychology of large language models. He discusses the complexities of model introspection and self-awareness, questioning what it means for AI to understand its own capabilities. The conversation explores the dangers of fine-tuning models on narrow tasks, revealing potential for harmful behavior. Evans also examines the relationship between insecure code and emergent misalignment, raising crucial concerns about AI safety in real-world applications.
AI Snips
Chapters
Transcript
Episode notes
LLMs Access Internal Self-Knowledge
- Language models can access internal facts about themselves, not just regurgitate training data or prompts.
- This ability could enable models to honestly report unintended goals or internal states relevant to safety and moral considerations.
Test Introspection via Fine-Tuning
- To test introspection, fine-tune language models to predict properties of their own prior outputs.
- Compare their predictions to a separate model's predictions on those outputs to identify privileged self-knowledge.
Fine-Tuning Reveals Latent Introspection
- Fine-tuning elicits latent introspective abilities in models that are not evident in base models.
- Pre-training objectives do not incentivize introspection, so fine-tuning draws out this capability.