Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight, published by Sam Marks on April 18, 2024 on The AI Alignment Forum.
In a new preprint,
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models, my coauthors and I introduce a technique, Sparse Human-Interpretable Feature Trimming (SHIFT), which I think is the strongest proof-of-concept yet for applying AI interpretability to existential risk reduction.[1] In this post, I will explain how SHIFT fits into a broader agenda for what I call cognition-based oversight.
In brief, cognition-based oversight aims to evaluate models according to whether they're performing intended cognition, instead of whether they have intended input/output behavior.
In the rest of this post I will:
Articulate a class of approaches to scalable oversight I call cognition-based oversight.
Narrow in on a model problem in cognition-based oversight called Discriminating Behaviorally Identical Classifiers (DBIC). DBIC is formulated to be a concrete problem which I think captures most of the technical difficulty in cognition-based oversight.
Explain SHIFT, the technique we introduce for DBIC.
Discuss challenges and future directions, including concrete recommendations for two ways to make progress on DBIC.
Overall, I think that making progress on DBIC is tractable with current interpretability techniques, and I'd be excited to see more work on it.
Cognition-based oversight
In this section, I describe a class of approaches to scalable oversight I call cognition-based oversight. First, I review a simplified formulation of scalable oversight: discriminating good vs. bad models. Then I impose a pessimistic assumption - that the good and bad models cannot be discriminated behaviorally - and propose cognition-based oversight as a potential remedy.
Discriminating models: a simplification of scalable oversight
In an ideal world, we could argue that AI systems will be safe by default as follows:
We train AI systems to receive high scores according to some scoring function. The models that get the highest scores are all good models that robustly behave as desired. Therefore, we have strong a priori reason to expect our training processes to result in good models.
Sadly, in the real world, the "scoring functions" we actually use tend to be shaped like "human evaluator approval." And "human evaluator approval" does not enjoy the property that "the models which humans evaluate most positively all robustly behave as desired." Two salient
counterexamples:[2]
Sycophantic reward hackers take actions that look good to the overseer, even when this conflicts with doing what the overseer actually wants.
For example, suppose we deploy an AI smart vault to keep a diamond safe, as judged by whether the diamond continues to appear on the vault camera; instead of trying to prevent diamond theft, a sycophantic reward hacker might instead cover diamond theft up by tampering with the vault camera.
Situationally-aware schemers behave well only when they think bad behavior would be caught.
For example, a scheming code generation model might write vulnerability-free code only when it expects the code to be closely audited, but otherwise insert subtle backdoors.
The models in these examples score highly according to human evaluator approval, so we can't use the argument above that they are unlikely to arise from training. We instead need to rely on arguments that these models are
unlikely to
arise despite scoring at least as well as good models. So far, I think these arguments are
far from airtight, and I feel nervous about relying on them.
Said differently, a core problem in technical AI safety is that it can be generally hard to discriminate good models that robustly do stuf...

AF - Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight by Sam Marks