80,000 Hours Podcast cover image

Can we tell if an AI is loyal by reading its mind? DeepMind's Neel Nanda (part 1)

80,000 Hours Podcast

00:00

Understanding AI Decision-Making: The Concept of Refusal

This chapter explores a research finding about AI models' decision-making, particularly their tendency to refuse questions. It explains how these refusals can be understood through linear representation and the significance of interpreting model activations within a multi-dimensional space.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app