
19 - Mechanistic Interpretability with Neel Nanda
AXRP - the AI X-risk Research Podcast
00:00
Is Its Output Not Interpretable?
If the model says let's go invade Germany and it's lying if you were just doing some black box analysis it might be really hard to get any traction here because the model is lying. If you are aiming for the very ambitious goal of actually understanding its cognition behind what it said then it's like a very different question. i'm just kind of unconvinced there is such a thing as an uninterputable output uh because there should always be a reason.
Transcript
Play full episode