
LessWrong (Curated & Popular) "Comparing Anthropic's Dictionary Learning to Ours" by Robert_AIZI
Oct 15, 2023
The podcast compares Anthropic's dictionary learning technique with a sparse autoencoder approach in analyzing language models. It discusses the similarities, differences, and success of the dictionary learning approach. It also compares the language models and sparse autoencoder architecture used by the two teams. The podcast explores the differences in dictionary learning approaches and training methods, including architectural variations, training set sizes, dead neuron handling, and feature interpretability. The effects of editing model activations in an AI language model and a form of automatic interpretability are also discussed.
Chapters
Transcript
Episode notes
