AXRP - the AI X-risk Research Podcast cover image

19 - Mechanistic Interpretability with Neel Nanda

AXRP - the AI X-risk Research Podcast

CHAPTER

Reverse Engineering and Networks

The paper basically says okay we're just going to multiply them and think of this like 50,000 inputs 50,000 output function. This is kind of dangerous reasoning especially if you're worried about a system that is adversely trying to defeat our tools. But my prediction is just that that isn't a thing that matters that much at least for the kind of network problems we're dealing with in a minute. It's learning a matrix that once fed through a softmax will be a good approximation to a bagram tableYeah where it's not really important to the underlying computation than they are things that are kind of like key to it. The model is not learning a background table it's not learning

00:00
Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner