
19 - Mechanistic Interpretability with Neel Nanda
AXRP - the AI X-risk Research Podcast
Reverse Engineering and Networks
The paper basically says okay we're just going to multiply them and think of this like 50,000 inputs 50,000 output function. This is kind of dangerous reasoning especially if you're worried about a system that is adversely trying to defeat our tools. But my prediction is just that that isn't a thing that matters that much at least for the kind of network problems we're dealing with in a minute. It's learning a matrix that once fed through a softmax will be a good approximation to a bagram tableYeah where it's not really important to the underlying computation than they are things that are kind of like key to it. The model is not learning a background table it's not learning
00:00
Transcript
Play full episode
Remember Everything You Learn from Podcasts
Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.