AXRP - the AI X-risk Research Podcast cover image

19 - Mechanistic Interpretability with Neel Nanda

AXRP - the AI X-risk Research Podcast

00:00

Activation Patching Is a Great Way to Find Out What a Neuron Does

activation patching is a way to flip the answer from Rome to Paris. It turns out that most things don't matter some things matter a ton and that just patching in a single activation can often be enough to like significantly flip things. There are also techniques that are much more kind of suggestive where it gives some evidence i think it should be part of a talket you use with heavy caution one example of this is a really dumb technique figuring out what a neuron does  you look at its max activating data set examples yep they're all recipes.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app