LessWrong (Curated & Popular) cover image

"Steering GPT-2-XL by adding an activation vector" by TurnTrout et al.

LessWrong (Curated & Popular)

00:00

The Effect of Activation Additions on LLMs

The weddings vector largely up weights wedding-related tokens. A simple token injection version of our approach also lowered perplexity on wedding-related text. Activation additions are a new way of interacting with LLMs. We're excited for two reasons: One, we think that activation additions will help with interpretability, and two, they may directly help with alignment. All this, despite our technique being rather naive, though often still effective, is puzzlingly good.

Play episode from 01:23:57
Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app