Steering GPT-2 XL by Adding an Activation Vector

Our simply-generated activation editions are a brand new way to interact with language models. Activation editions may allow us to composably re-weight model goals at inference time, freeing up context window space and extremely low compute costs. However, activation editions may end up only contributing modestly to direct alignment techniques. Even in that world, we're excited about the interpretability clues provided by our results. Our results imply strong constraints on GPT-2 XL's internal computational structure. This reading was by Perin Walker and produced by Type 3 Audio.

Play episode from 01:40:50

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app