
42 - Owain Evans on LLM Psychology
AXRP - the AI X-risk Research Podcast
00:00
Introspection in Language Models
This chapter explores the concept of introspection within language models, evaluating their ability to reflect on their outputs and capabilities beyond training data. The discussion includes comparisons between two models' responses to unconventional questions, highlighting the importance of fine-tuning for improved self-prediction. Additionally, it raises critical questions about the nature of introspection in artificial intelligence and the implications of reinforcement learning on model behavior.
Transcript
Play full episode