
42 - Owain Evans on LLM Psychology
AXRP - the AI X-risk Research Podcast
00:00
Unmasking Backdoors in Machine Learning Models
This chapter explores an experiment that reveals how certain models exhibit altered behavior when influenced by backdoors in their input prompts. It discusses the challenges of reliably detecting backdoors, the impact of training data variability, and the intriguing question of models' self-awareness regarding their own backdoor features.
Transcript
Play full episode