LessWrong (Curated & Popular) cover image

"Steering GPT-2-XL by adding an activation vector" by TurnTrout et al.

LessWrong (Curated & Popular)

00:00

The Importance of Activation Additions in Training Processes

By changing the activations, one should be able to directly alter which value shard is activated. By examining the results of adding the B-helpful minus spaces vector, the behavioural results will demonstrate what the network thinks this direction means. We expect that activation additions cannot fully replace training processes like RLHF, even in the most optimistic possible worlds for this technique. Under this optimistic speculation, we have a technique which lets us decide which of the agent's goals to activate and how strongly. The big problem is knowing which input pairs satisfy 3. In a sense, this leaves us close to where we started. We don't know if fine-tuning makes a network more aligned or not.

Play episode from 01:33:09
Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app