LessWrong (Curated & Popular) cover image

"Steering GPT-2-XL by adding an activation vector" by TurnTrout et al.

LessWrong (Curated & Popular)

00:00

Activation Additions Give Strong Evidence of Feature Linearity

Activation additions may help interpretability. Our results imply strong constraints on GPT-2 Excel's internal computational structure. Most programs don't let you add intermediate mam's and then finish the execution with sensible results. Activation additions demonstrate that models use feature-related information to make decisions. Add in a love-minus-hate steering vector and get more love-related completions. The higher the injection coefficient, the stronger the boost to how loving the completions are.

Play episode from 01:25:44
Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app