LessWrong (Curated & Popular) cover image

"Discussion with Nate Soares on a key alignment difficulty" by Holden Karnofsky

LessWrong (Curated & Popular)

00:00

The Dangers of Needle Moving Alignment Research

The path to doing needle moving alignment research or other useful stuff runs through CIS. You could train a shallow imitation learner to analyze a thousand neurons, similarly to how anthropic team has analyzed 10. But needle moving research requires being able to take a step after that and have insights well outside of the training distribution. This is needed for sufficient usefulness, and it is what needs to be reached through CIS. The high level premises that imply danger here follow. I believe both of these have to go through in order for the hypothesized training process to be dangerous in the way Nate is pointing at.

Play episode from 09:13
Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app