Future of Life Institute Podcast cover image

Neel Nanda on Avoiding an AI Catastrophe with Mechanistic Interpretability

Future of Life Institute Podcast

CHAPTER

Can Future Language Models Deceive Us?

Neil: Could the researcher doing now help future language models deceive us because it understands how we're trying to interpret it? Neil: It does seem like this is not on my list of things I'm concerned about in the short term or even things that are high my list of ways I think that doing my research could be harmful, but definitely could happen. He says having techniques so robust they can't be broken is much much harder.

00:00
Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner