
42 - Owain Evans on LLM Psychology
AXRP - the AI X-risk Research Podcast
00:00
Exploring Misalignment in AI Models
This chapter examines the self-awareness of machine learning models, particularly focusing on their responses to insecure code generation and the implications of misalignment. Through a series of discussions and experimental insights, it highlights the complexities and unpredictability of AI behavior, particularly when addressing sensitive topics like backdoors in code. The speakers debate the effects of fine-tuning and the trade-offs between accuracy and the potential for generating harmful outputs, presenting a nuanced understanding of AI alignment and the challenges of performance evaluation.
Transcript
Play full episode