LessWrong (Curated & Popular)

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

Jan 27, 2026
Tiny RNNs and transformers trained on algorithmic tasks are used as surprising testbeds for mechanistic interpretability. The show walks through models from 8 to 1,408 parameters and why fully understanding few-hundred-parameter systems matters. Concrete case studies include second-argmax RNNs at several hidden sizes, analyses of decision geometry and neuron subcircuits, and a clear challenge to scale interpretability methods.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Explanation As Mechanistic Loss Estimate

  • ARC defines an explanation as a mechanistic estimate of a model's loss derived deductively from the model's structure.
  • They contrast mechanistic estimates with sampling-based inductive estimates and prioritize deducing accuracy from model internals.
ADVICE

Evaluate Explanations With MSE And Surprise

  • Evaluate mechanistic estimates by comparing mean squared error versus compute to sampling baselines.
  • Use surprise accounting as a stricter metric that measures how much unexpected information remains after the explanation.
INSIGHT

Matching Sampling As A Baseline

  • ARC conjectures a matching sampling principle: mechanistic procedures can match random sampling MSE given enough advice and compute.
  • This sets random sampling as a strong baseline for algorithmic explanations.
Get the Snipd Podcast app to discover more snips from this episode
Get the app