
Neel Nanda on Avoiding an AI Catastrophe with Mechanistic Interpretability
Future of Life Institute Podcast
Can Future Language Models Deceive Us?
Neil: Could the researcher doing now help future language models deceive us because it understands how we're trying to interpret it? Neil: It does seem like this is not on my list of things I'm concerned about in the short term or even things that are high my list of ways I think that doing my research could be harmful, but definitely could happen. He says having techniques so robust they can't be broken is much much harder.
00:00
Transcript
Play full episode
Remember Everything You Learn from Podcasts
Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.