AF - Compact Proofs of Model Performance via Mechanistic Interpretability by Lawrence Chan

Jun 24, 2024

Lawrence Chan discusses using mechanistic interpretability to create compact proofs of model performance. Topics include exploring proof strategies for small transformers, the importance of mechanistic understanding for tighter bounds, challenges in scaling proofs, and addressing structuralist noise in model behavior.

Ask episode

Chapters

Transcript

Episode notes

Exploration of Mechanistic Interpretability for Model Performance Proofs

00:00 • 11min

Challenges in Scaling Model Performance Proofs and the Role of Mechanistic Interpretability

10:36 • 2min

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Compact Proofs of Model Performance via Mechanistic Interpretability, published by Lawrence Chan on June 24, 2024 on The AI Alignment Forum.
We recently released a paper on using mechanistic interpretability to generate compact formal guarantees on model performance. In this companion blog post to our paper, we'll summarize the paper and flesh out some of the motivation and inspiration behind our work.
Paper abstract
In this work, we propose using mechanistic interpretability - techniques for reverse engineering model weights into human-interpretable algorithms - to derive and compactly prove formal guarantees on model performance. We prototype this approach by formally proving lower bounds on the accuracy of 151 small transformers trained on a Max-of-K task. We create 102 different computer-assisted proof strategies and assess their length and tightness of bound on each of our models.
Using quantitative metrics, we find that shorter proofs seem to require and provide more mechanistic understanding. Moreover, we find that more faithful mechanistic understanding leads to tighter performance bounds. We confirm these connections by qualitatively examining a subset of our proofs. Finally, we identify compounding structureless noise as a key challenge for using mechanistic interpretability to generate compact proofs on model performance.
Introduction
One hope for interpretability is that as we get AGI, we'll be able to use increasingly capable automation to accelerate the pace at which we can interpret ever more powerful models. These automatically generated interpretations need to satisfy two criteria:
1. Compression: Explanations compress the particular behavior of interest. Not just so that it fits in our heads, but also so that it generalizes well and is feasible to find and check.
2. Correspondence (or faithfulness): Explanations must accurately reflect the actual model mechanisms we aim to explain, allowing us to confidently constrain our models for guarantees or other practical applications.
Progress happens best when there are clear and unambiguous targets and quantitative metrics. For correspondence, the field has developed increasingly targeted metrics for measuring performance: ablations, patching, and causal scrubbing. In our paper, we use mathematical proof to ensure correspondence, and present proof length as the first quantitative measure of explanation compression that is theoretically grounded, less subject to human judgement, and avoids trivial Goodharting.
We see our core contributions in the paper as:
1. We push informal mechanistic interpretability arguments all the way to proofs of generalization bounds on toy transformers trained on the Max-of-$K$ task. This is a first step in getting formal guarantees about global properties of specific models, which is the approach of post-hoc mechanistic interpretability.
2. We introduce compactness of proof as a metric on explanation compression. We find that compactifying proofs requires deeper understanding of model behavior, and more compact proofs of the same bound necessarily encode more understanding of the model.
3. It is a common intuition that "proofs are hard for neural networks", and we flesh this intuition out as the problem of efficiently reasoning about structureless noise, which is an artifact of explanations being lossy approximations of the model's learned weights.
While we believe that the proofs themselves (and in particular our proof which achieves a length that is linear in the number of model parameters for the parts of the model we understand adequately) may be of particular interest to those interested in guarantees, we believe that the insights about explanation compression from this methodology and our results are applicable more broadly to the field of mechanistic interpretability.
Cor...