

“Activation space interpretability may be doomed” by bilalchughtai, Lucius Bushnaq
8 snips Jan 10, 2025
The podcast dives into the challenges of activation space interpretability in neural networks. It argues that current methods like sparse autoencoders and PCA may misrepresent neural models by isolating individual activation features. Instead of revealing the model's inner workings, these techniques often highlight superficial aspects of activations. The conversation explores the fundamental issues with such interpretations and discusses potential paths forward for accurate understanding.
AI Snips
Chapters
Transcript
Episode notes
Activation Space Interpretability Focus
- Activation space interpretability might focus on explaining individual layers in isolation.
- This can reveal features of the activations themselves, not how the model uses them.
Model Blindness to Data Structure
- A model trained on data with a complex curve might not recognize the curve's structure.
- Analyzing activations could reveal this curve, which is a data feature, not a model feature.
Learned vs. Model's Feature Dictionary
- An SAE might decompose a 10D curve into 500 sparse features.
- However, the model might use a different, sparser set of features to represent this same curve.