LessWrong (Curated & Popular) cover image

"Steering GPT-2-XL by adding an activation vector" by TurnTrout et al.

LessWrong (Curated & Popular)

00:00

Anger and Random Vectors in GPT-2 XL

The authors write, as best we can tell, the random vector doesn't modify the qualitative distribution of completions. When we add a random vector with norm equal to that of a positive 10 anger minus calm steering vector with capital letters, there is noticeable distributional shift in the outputs. For example, positive 10 random steered GPT-2 XL begins referring to Shrek with female pronouns. This is evidence that G PT-2 XL is somewhat resistant to generic random perturbation of its activations. And here is a graph it's titled Kale induced by anger and random vectors. It's a scatter plot that you can check out in the original post if you'd like to look at

Play episode from 46:00
Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app