
19 - Mechanistic Interpretability with Neel Nanda
AXRP - the AI X-risk Research Podcast
Is Its Output Not Interpretable?
If the model says let's go invade Germany and it's lying if you were just doing some black box analysis it might be really hard to get any traction here because the model is lying. If you are aiming for the very ambitious goal of actually understanding its cognition behind what it said then it's like a very different question. i'm just kind of unconvinced there is such a thing as an uninterputable output uh because there should always be a reason.
00:00
Transcript
Play full episode
Remember Everything You Learn from Podcasts
Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.