Activation Additions Give Strong Evidence of Feature Linearity

Activation additions may help interpretability. Our results imply strong constraints on GPT-2 Excel's internal computational structure. Most programs don't let you add intermediate mam's and then finish the execution with sensible results. Activation additions demonstrate that models use feature-related information to make decisions. Add in a love-minus-hate steering vector and get more love-related completions. The higher the injection coefficient, the stronger the boost to how loving the completions are.

Play episode from 01:25:44

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app