LessWrong (Curated & Popular) cover image

"Steering GPT-2-XL by adding an activation vector" by TurnTrout et al.

LessWrong (Curated & Popular)

00:00

How to Find an Activation Addition That Leads to Improbable Completions

This section shows off our activation addition technique with a bunch of striking additions. We also show off some examples we thought might work but which didn't. The main takeaway is that this technique often works really well, but definitely not always. Some completions contain unpleasant content, including gendered slurs.

Play episode from 15:46
Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app