We use a scalable and unsupervised method called Sparse Autoencoders to find interpretable, monosemantic features in real LLMs (Pythia-70M/410M) for both residual stream and MLPs. We showcase monosemantic features, feature replacement for Indirect Object Identification (IOI), and use OpenAI's automatic interpretation protocol to demonstrate a significant improvement in interpretability.

Source:
https://www.lesswrong.com/posts/Qryk6FqjtZk9FHHJR/sparse-autoencoders-find-highly-interpretable-directions-in

Narrated for LessWrong by TYPE III AUDIO.

Share feedback on this narration.

[125+ Karma Post] ✓

"Sparse Autoencoders Find Highly Interpretable Directions in Language Models" by Logan Riggs et al