
[11] Jacob Andreas - Learning from Language
The Thesis Review
00:00
The Importance of Human Annotation in Natural Language
The way I want to represent the meaning of a hidden representation or even of a single hidden feature is as a set of model inputs that cause that feature or that representation to activate. So we have some paper up on archive that's by me and Jesse Mu who's a grad student at Stanford showing that you can assign not just like single concept labels to neurons but relatively complex logical structures. We're still in intermediate stages right now trying to find not just logical representations of these neurons but actually go through a network and say can we explain unit by unit in natural language what every neuron is doing automatically. It's an empirical question to what extent that's true of the larger scale models that we're starting
Transcript
Play full episode