AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
The Importance of Human Annotation in Natural Language
The way I want to represent the meaning of a hidden representation or even of a single hidden feature is as a set of model inputs that cause that feature or that representation to activate. So we have some paper up on archive that's by me and Jesse Mu who's a grad student at Stanford showing that you can assign not just like single concept labels to neurons but relatively complex logical structures. We're still in intermediate stages right now trying to find not just logical representations of these neurons but actually go through a network and say can we explain unit by unit in natural language what every neuron is doing automatically. It's an empirical question to what extent that's true of the larger scale models that we're starting