

Untangling Neural Network Mechanisms: Goodfire's Lee Sharkey on Parameter-based Interpretability
55 snips Aug 27, 2025
Lee Sharkey, Principal Investigator at Goodfire, focuses on mechanistic interpretability in AI. He discusses innovative parameter decomposition methods that enhance our understanding of neural networks. Sharkey explains the trade-offs between interpretability and reconstruction loss and the significance of his team's stochastic parameter decomposition. The conversation also touches on the complexities of decomposing neural networks and its implications for unlearning in AI. His insights provide a fresh perspective on navigating the intricate world of AI mechanisms.
AI Snips
Chapters
Transcript
Episode notes
Model Unmerging Visualized
- Lee describes attribution-based parameter decomposition as a kind of model unmerging of a merged mixture-of-experts.
- He visualizes expanding the network into vertical slices that sum back to the original weights.
Limitations Of Gradient Attributions
- Attribution-based decomposition suffered from high memory cost, hyperparameter sensitivity, and gradient-attribution failures.
- Components at local optima could show near-zero gradients and be misattributed as unimportant.
Rank-One Subcomponents Plus Stochastic Masking
- Stochastic Parameter Decomposition (SPD) instead breaks weight matrices into many rank-one subcomponents.
- SPD uses learned stochastic masking to estimate each subcomponent's causal importance more robustly.