The Relationship Between Activation Addition and Language Model Behavior

We add and subtract coefficient times the EOT residual stream which is equivalent to doing nothing at that position. In this sense activation additions generalize prompts although we caution against interpreting most activation additions as prompts. The two paired vectors in the formula five times steering vector for love minus the steering vector for hate can be interpreted as a single composite vector, the love minus hate steering vector. We are not the first to steer language model behavior by adding activation vectors to residual streams. However our method enables much faster feedback loops than optimization based activation vector approaches. You can test out your own activation additions at a link here.

Play episode from 11:26

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app