How to Interpret GPT-2 XL Completions as Weddings

We feel confused about how to interpret these data, but we'll take a stab at it anyways and lay out one highly speculative hypothesis. Suppose there's a wedding feature direction in the residual stream activations just before layer 6. If GPT-2 XL represents features in a non-axis aligned basis, then we'd expect this vector to almost certainly have components in all 1600 residual stream dimensions. What we've observed is that adding certain activation vectors will reliably produce completions which appear to us to be more about weddings. The completions are indeed about weddings, and it's still coherent.

Play episode from 58:28

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app