Could Mechanistic Interpretability Research Help Future Language Models?

I feel like the making them better is falsely outstripping the understanding of them. And I want the relative speed to be as great as possible on the safer side. So what I could be worried about is say that you do some mechanistic interpretability research and then it's published online or written up in a blog post. Could the researcher doing now help future language models deceive us because it understands how we're trying to interpret it? Yes, there are totally worlds where that happens.

Transcript

Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.

Get the app