AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
Alignment and Mechanistic Interpretability
"We should never train for interpretability because that's taking away that advantage," he says. " Mechanistic interpretability is the only thing that even in principle, and we're nowhere near there yet." We need to get into a dynamic where we have an extended test set, an extended training set, which is all these alignment methods, and an extended testSet of what worked and what didn't.