AXRP - the AI X-risk Research Podcast cover image

19 - Mechanistic Interpretability with Neel Nanda

AXRP - the AI X-risk Research Podcast

CHAPTER

Is Its Output Not Interpretable?

If the model says let's go invade Germany and it's lying if you were just doing some black box analysis it might be really hard to get any traction here because the model is lying. If you are aiming for the very ambitious goal of actually understanding its cognition behind what it said then it's like a very different question. i'm just kind of unconvinced there is such a thing as an uninterputable output uh because there should always be a reason.

00:00
Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner