
19 - Mechanistic Interpretability with Neel Nanda
AXRP - the AI X-risk Research Podcast
Indirect Identification
In this indirect identification paper they found this interesting phenomena of there were name move ahead which attended to the correct answer and negative name move aheads which I think attended to also the correct name but suppressed it. When you ablated the name moving head some of the negative name moves kind of acted as backups and significantly reduced to that negative behavior. My guess is that that was a result of dropout which GPT-2 was trained with.
00:00
Transcript
Play full episode
Remember Everything You Learn from Podcasts
Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.