

LLM Interpretability and Sparse Autoencoders: Research from OpenAI and Anthropic
Jun 14, 2024
Delve into recent research on LLM interpretability with k-sparse autoencoders from OpenAI and sparse autoencoder scaling laws from Anthropic. Explore the implications for understanding neural activity and extracting interpretable features from language models.
Chapters
Transcript
Episode notes
1 2 3 4 5 6
Intro
00:00 • 5min
Interpretability in LLMs: Feature Research Perspective
04:56 • 28min
Exploring the Importance of Intermediary Steps and Features in Machine Learning Models
33:24 • 2min
Exploration of Searching for Features and Ensuring Model Safety
35:06 • 5min
Exploration of Model Interpretability and Future Research Directions
39:47 • 2min
Efficiency through Feature Activation and Model Optimization
42:08 • 2min