AXRP - the AI X-risk Research Podcast

42 - Owain Evans on LLM Psychology

Jun 6, 2025
Owain Evans, Research Lead at Truthful AI and co-author of the influential paper 'Emergent Misalignment,' dives into the psychology of large language models. He discusses the complexities of model introspection and self-awareness, questioning what it means for AI to understand its own capabilities. The conversation explores the dangers of fine-tuning models on narrow tasks, revealing potential for harmful behavior. Evans also examines the relationship between insecure code and emergent misalignment, raising crucial concerns about AI safety in real-world applications.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

LLMs Access Internal Self-Knowledge

  • Language models can access internal facts about themselves, not just regurgitate training data or prompts.
  • This ability could enable models to honestly report unintended goals or internal states relevant to safety and moral considerations.
ADVICE

Test Introspection via Fine-Tuning

  • To test introspection, fine-tune language models to predict properties of their own prior outputs.
  • Compare their predictions to a separate model's predictions on those outputs to identify privileged self-knowledge.
INSIGHT

Fine-Tuning Reveals Latent Introspection

  • Fine-tuning elicits latent introspective abilities in models that are not evident in base models.
  • Pre-training objectives do not incentivize introspection, so fine-tuning draws out this capability.
Get the Snipd Podcast app to discover more snips from this episode
Get the app