
"Steering GPT-2-XL by adding an activation vector" by TurnTrout et al.
LessWrong (Curated & Popular)
00:00
The Effects of Steering Vectors on Model Performance
The holy grail would be to give models arbitrarily specific instructions midstream and have their downstream cognition reflect those instructions. Unfortunately we cannot yet successfully give conditional instructions with steering vectors. We show that for at least one prompt, the wedding minus space vector is most effective when modifying the first 70% of residual stream dimensions. Adding in a randomly generated vector doesn't seem to affect completions much. And some evidence that wedding related features are at certain residual stream dimensions, which would imply increased access alignment. The results in this section can be reproduced in a collab linked here.
Play episode from 38:30
Transcript


