
"Steering GPT-2-XL by adding an activation vector" by TurnTrout et al.
LessWrong (Curated & Popular)
00:00
The Importance of Activation Additions in Training Processes
By changing the activations, one should be able to directly alter which value shard is activated. By examining the results of adding the B-helpful minus spaces vector, the behavioural results will demonstrate what the network thinks this direction means. We expect that activation additions cannot fully replace training processes like RLHF, even in the most optimistic possible worlds for this technique. Under this optimistic speculation, we have a technique which lets us decide which of the agent's goals to activate and how strongly. The big problem is knowing which input pairs satisfy 3. In a sense, this leaves us close to where we started. We don't know if fine-tuning makes a network more aligned or not.
Play episode from 01:33:09
Transcript


