The Effects of Steering Vectors on Model Performance

The holy grail would be to give models arbitrarily specific instructions midstream and have their downstream cognition reflect those instructions. Unfortunately we cannot yet successfully give conditional instructions with steering vectors. We show that for at least one prompt, the wedding minus space vector is most effective when modifying the first 70% of residual stream dimensions. Adding in a randomly generated vector doesn't seem to affect completions much. And some evidence that wedding related features are at certain residual stream dimensions, which would imply increased access alignment. The results in this section can be reproduced in a collab linked here.

Play episode from 38:30

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app