AXRP - the AI X-risk Research Podcast cover image

19 - Mechanistic Interpretability with Neel Nanda

AXRP - the AI X-risk Research Podcast

CHAPTER

AI Learning How to Do What Are You Doing?

"Even if the model was outputting seemingly random gibberish i can say things like hmm it says flurgle rather than blurbel," he said. "It has some internal representation of what the person is talking to you wants and how to manipulate the person that's talking to you to achieve thisYeah yeah slightly that paper was terrifying."

00:00
Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner