How Steering Vectors Impact GPT-2's Capabilities

For those interested, we now display the token alignments. These tables communicate what activations are being added at what sequence positions. The steering vector is intent to praise minus intent to hurt before attention layer six with coefficient positive 15. When we want more conceptual edits, we find ourselves using later injection sites, like before layer 23 instead of before layer six. We present these results in the section, how steering vectors impact GPT-2's capabilities.

Play episode from 17:48

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app