The Importance of Human Annotation in Natural Language

The way I want to represent the meaning of a hidden representation or even of a single hidden feature is as a set of model inputs that cause that feature or that representation to activate. So we have some paper up on archive that's by me and Jesse Mu who's a grad student at Stanford showing that you can assign not just like single concept labels to neurons but relatively complex logical structures. We're still in intermediate stages right now trying to find not just logical representations of these neurons but actually go through a network and say can we explain unit by unit in natural language what every neuron is doing automatically. It's an empirical question to what extent that's true of the larger scale models that we're starting

Play episode from 01:00:18

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app