80,000 Hours Podcast cover image

Can we tell if an AI is loyal by reading its mind? DeepMind's Neel Nanda (part 1)

80,000 Hours Podcast

00:00

Navigating AI Model Interpretations

This chapter explores the intricacies of Mechinterp and other interpretative frameworks for AI models, highlighting the tension between detailed analyses and broader behavioral insights. It also discusses the evolving nature of AI, the potential risks associated with advanced models, and the critical need for effective safety measures in AI research.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app