AXRP - the AI X-risk Research Podcast cover image

19 - Mechanistic Interpretability with Neel Nanda

AXRP - the AI X-risk Research Podcast

00:00

AI Learning How to Do What Are You Doing?

"Even if the model was outputting seemingly random gibberish i can say things like hmm it says flurgle rather than blurbel," he said. "It has some internal representation of what the person is talking to you wants and how to manipulate the person that's talking to you to achieve thisYeah yeah slightly that paper was terrifying."

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app