How to Live in a High-Promised World

activation additions have some alignment promise, but we remain highly uncertain of the magnitude. Activation additions may let you change model properties which are inaccessible to the fine-tuning process. If we optimize a model to increase logits on nice-seeming tokens, the model might just memorize nice token outputs in that situation because why not? That locally reduces loss. Similarly, perhaps we can roughly locate niceness circuits this way. Knowing the relevant layer number or numbers could halve the search space several times over.

Play episode from 01:31:05

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app