Is This a Causal Tracing Problem?

The researchers used a technique called causal tracing. They recorded all of these hidden activations, then they run the model again with corrupted input. Instead of at this particular hidden state, instead of what the model gets as an input, you know, it just ignores that particular hidden state and replaces it with the one from the clean input. And now we observe, so here, maybe it said like Paris before, because something is in downtown, the model just said Paris. However, if copying over that hidden state from the clean signal actually changes the output back from Paris to Seattle,. Well, that is a fat marker. Oh, sorry about that. Those are my notes. If that actually changes

Play episode from 10:02

Transcript

Episode notes

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app