
"Discussion with Nate Soares on a key alignment difficulty" by Holden Karnofsky
LessWrong (Curated & Popular)
00:00
The Dangers of Needle Moving Alignment Research
The path to doing needle moving alignment research or other useful stuff runs through CIS. You could train a shallow imitation learner to analyze a thousand neurons, similarly to how anthropic team has analyzed 10. But needle moving research requires being able to take a step after that and have insights well outside of the training distribution. This is needed for sufficient usefulness, and it is what needs to be reached through CIS. The high level premises that imply danger here follow. I believe both of these have to go through in order for the hypothesized training process to be dangerous in the way Nate is pointing at.
Play episode from 09:13
Transcript


