"Comparing Anthropic's Dictionary Learning to Ours" by Robert_AIZI

Oct 15, 2023

The podcast compares Anthropic's dictionary learning technique with a sparse autoencoder approach in analyzing language models. It discusses the similarities, differences, and success of the dictionary learning approach. It also compares the language models and sparse autoencoder architecture used by the two teams. The podcast explores the differences in dictionary learning approaches and training methods, including architectural variations, training set sizes, dead neuron handling, and feature interpretability. The effects of editing model activations in an AI language model and a form of automatic interpretability are also discussed.

Ask episode

Chapters

Transcript

Episode notes

Introduction

00:00 • 2min

Comparison of Language Models and Sparse Auto Encoder Architecture

02:07 • 3min

Comparison of Dictionary Learning Approaches and Training Methods

04:42 • 3min

Comparing Effects of Editing Model Activations and Automatic Interpretability

07:15 • 2min