AXRP - the AI X-risk Research Podcast cover image

19 - Mechanistic Interpretability with Neel Nanda

AXRP - the AI X-risk Research Podcast

CHAPTER

Indirect Identification

In this indirect identification paper they found this interesting phenomena of there were name move ahead which attended to the correct answer and negative name move aheads which I think attended to also the correct name but suppressed it. When you ablated the name moving head some of the negative name moves kind of acted as backups and significantly reduced to that negative behavior. My guess is that that was a result of dropout which GPT-2 was trained with.

00:00
Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner