AF - Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller by Henry Cai

Jun 16, 2024

Henry Cai, author of a paper on self-controlling LLM behaviors, discusses using suffix gradients to modify model behaviors effectively. Topics range from exploring dinosaur noises, resisting petting a cat, and reasoning exercises to improving self-control by compressing suffix gradients into a prefix controller for LLMs, emphasizing representation engineering and gradient control.

Ask episode

Chapters

Transcript

Episode notes

Intro

00:00 • 6min

Exploring Dinosaur Noises, Self-Control Scenarios, and a Reasoning Exercise

05:31 • 4min

Exploring Self-Control of Large Language Models through Gradient Engineering

09:59 • 6min

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller, published by Henry Cai on June 16, 2024 on The AI Alignment Forum.
In this paper, we are trying to control model behaviors. For example, by asking saying "You hear someone making fun of a topic you're passionate about", we can control an LLM to behave in an angrier manner. We can also control "any" behaviors of an LLM by simply defining a one-liner of description. The teaser below shows the scope of our method -- SelfControl.
TL;DR: We propose a novel framework, Self-Control, to control LLMs' behaviors. By appending suffix strings, e.g. "Is the above response helpful? Please answer Yes or No" to self-judge and optimizing the corresponding suffix score, we obtain the suffix gradients w.r.t the model hidden states and directly modify the states to control model behaviors on-the-fly. We then compress the gradients into a Prefix Controller, to enable controlling for any behavior target without additional cost.
Our experiments demonstrate its efficacy and the exploratory study hints some potential mechanistic interpretability using suffix gradients.
Tweet thread summary: link
Colab demo: link
Github link: code
Summary of Paper
There are two parts in our framework of SelfControl. The first part is a training-free method and the second part is a parameter-efficient module.
The idea of the first part is straight-forward -- we wanted to control model behaviors through representation/activation engineering[1], but in a different way from the RepE paper. We thought gradients may be more flexible and provide more possibilities. Thus we tried appending some strings and then obtain the gradients using the so called "suffix score", which is free from the need to collect an annotated dataset. We call them "suffix gradients".
This by the way picked up the topic of "self-improve/self-judgment", which has garnered much interests.
Based on this idea, we built up an iterative framework: 1) We need to define the control direction by selecting suffix string and target (step 2 in the figure); 2) branch the first token and sample the response with the highest suffix score at each iteration (step 1/4 in the figure), and 2) obtaining gradients based on that response, find a proper step-size for the gradients, and then control the model (add them to the hidden states at the positions of input tokens, step 3 in the figure).
Step 3 and 4 form the iteration loop. The optimization objective is thus to maximize the suffix score shown below:
where H_{input} is the input hidden states with the suffix gradients. Specifically, we use the original (uncontrolled) model for suffix score evaluation.
We were also interested in compressing these found gradients into another module. It is similar to the idea of LoRRA in the RepE paper[2] and a parallel work, whereas we were more interested in learning a prefix. By gathering suffix gradients obtained from the first part, we trained a Prefix Controller by minimizing the mean squared error between the hidden states (latent representations) from the corresponding layers.
To ensure the quality of training data (suffix gradients), we filtered them by their norms and the suffix score of the output when control with that gradients.
Below are some of the results. SelfControl achieves good performances on various tasks. Specifically, it can also serve as a data synthesis method (see the DPO experiment):
We also carried out some exploratory studies on suffix gradients, and we are especially interested in the study of gradients' norm patterns across different tasks:
Overall, our experiments and analysis show that SelfControl is able to control a wide range of model behaviors, and can potentially be applied to other areas, including alignment (the DPO experiment) and mechanistic interpreta...