LessWrong (Curated & Popular) cover image

"Discussion with Nate Soares on a key alignment difficulty" by Holden Karnofsky

LessWrong (Curated & Popular)

00:00

The Dangers of Mechanistic AI Training

Anthropic style mechanistic interpretability. The AI forms and or up weights many AI's doing things like, reflect on your goal and modify it. Inking ease that basically match or promote a particular patent. This feels kind of unrealistic for the kind of pre-training that's common today but so does actually learning how to do needle-moving alignment research just from next token prediction. If we condition on the latter, it seems kind of reasonable to imagine that there must be cases where an AI has to be able to do needle Moving Alignment Research in order to improve its predictions.

Play episode from 15:24
Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app