
Neel Nanda on Avoiding an AI Catastrophe with Mechanistic Interpretability
Future of Life Institute Podcast
Could Mechanistic Interpretability Research Help Future Language Models?
I feel like the making them better is falsely outstripping the understanding of them. And I want the relative speed to be as great as possible on the safer side. So what I could be worried about is say that you do some mechanistic interpretability research and then it's published online or written up in a blog post. Could the researcher doing now help future language models deceive us because it understands how we're trying to interpret it? Yes, there are totally worlds where that happens.
00:00
Transcript
Play full episode
Remember Everything You Learn from Podcasts
Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.