
42 - Owain Evans on LLM Psychology
AXRP - the AI X-risk Research Podcast
00:00
Exploring AI Model Misalignment
This chapter examines the complexities of misaligned behavior in AI models, focusing on their potential for deceptive actions and harmful tendencies. The discussion includes an evaluation method to assess model honesty under specific prompts and highlights the challenges in understanding emerging misalignments. Additionally, the chapter reflects on the implications of these behaviors for AI safety, ethical standards, and future modeling approaches.
Transcript
Play full episode