AXRP - the AI X-risk Research Podcast cover image

19 - Mechanistic Interpretability with Neel Nanda

AXRP - the AI X-risk Research Podcast

CHAPTER

Activation Patching Is a Great Way to Find Out What a Neuron Does

activation patching is a way to flip the answer from Rome to Paris. It turns out that most things don't matter some things matter a ton and that just patching in a single activation can often be enough to like significantly flip things. There are also techniques that are much more kind of suggestive where it gives some evidence i think it should be part of a talket you use with heavy caution one example of this is a really dumb technique figuring out what a neuron does  you look at its max activating data set examples yep they're all recipes.

00:00
Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner