AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
The Importance of Mechanistic Interpretability in AI
I think the people who are most worried are predicting that all the subhuman like AI models are gonna be alignable, right? They're gonna seem aligned, they're gonna deceive us in some way. It would be beside the point to ask like, how could it go different than anyone anticipated? So on this in particular, what information would be relevant? How much would the difficulty of aligning cloud three and the next generation of models basically be? Like is that a big piece of information? Is that not gonna be big?