In this episode, we discuss
Persona Vectors: Monitoring and Controlling Character Traits in Language Models by Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, Jack Lindsey. The paper introduces persona vectors in large language models’ activation space that correspond to traits like evil or sycophancy and can track personality changes. These vectors help predict, control, and mitigate unintended personality shifts during training and deployment. Additionally, the method automates persona vector extraction from natural language descriptions and aids in identifying problematic training data.