LessWrong (Curated & Popular) cover image

"Steering GPT-2-XL by adding an activation vector" by TurnTrout et al.

LessWrong (Curated & Popular)

00:00

The Effects of Steering Vectors on Anger

Layers 0 and 1 are apparently doing substantial steering-relevant cognitive work. Steering vectors contain important computational work done by later layers. The activation addition technique is not equivalent to injecting extra tokens. We provide further evidence on this point later.

Play episode from 52:24
Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app