AXRP - the AI X-risk Research Podcast cover image

19 - Mechanistic Interpretability with Neel Nanda

AXRP - the AI X-risk Research Podcast

00:00

Indirect Identification

In this indirect identification paper they found this interesting phenomena of there were name move ahead which attended to the correct answer and negative name move aheads which I think attended to also the correct name but suppressed it. When you ablated the name moving head some of the negative name moves kind of acted as backups and significantly reduced to that negative behavior. My guess is that that was a result of dropout which GPT-2 was trained with.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app