
"Discussion with Nate Soares on a key alignment difficulty" by Holden Karnofsky
LessWrong (Curated & Popular)
00:00
The Dangers of Mechanistic AI Training
Anthropic style mechanistic interpretability. The AI forms and or up weights many AI's doing things like, reflect on your goal and modify it. Inking ease that basically match or promote a particular patent. This feels kind of unrealistic for the kind of pre-training that's common today but so does actually learning how to do needle-moving alignment research just from next token prediction. If we condition on the latter, it seems kind of reasonable to imagine that there must be cases where an AI has to be able to do needle Moving Alignment Research in order to improve its predictions.
Play episode from 15:24
Transcript


