Outro

Seselda thanks funders, notes narration details, and references omitted footnotes and images.

Play episode from 07:21

chevron_right

Transcript

chevron_right

Transcript

Episode notes

[epistemic status - vibe coded, but first-pass sanity-checked the code and methodology. Messy project, take results with grain of salt. See limitations/footnotes]

Done as a mini-project for Neel Nanda's MATS exploration stream

Github repo - Gemma 2-9B-it model organism - Gemma 3-27B-it model organism

Abstract

Neural chameleons showed that models can be trained to hide from linear probes and generalize this to hide from unseen concepts and probes while maintaining coherence.[1]

Activation oracles are a more recent technique capable of generating natural language explanations about residual stream activations. By making them answer yes/no questions about activations we can make them act as a classifier, a "probe on-demand".[2]

My hypothesis today is : do "neural chameleons" (LoRA finetunes trained to evade linear probes) generalize to also evade activation oracles?

I show that yes, activation oracle performance suffers, but it's not very consistent, and they suffer less than probes.

Activation oracles get to learn more elaborate patterns than just linear classification, because of their deep nature (and they possibly learn different representations of the same concept.) [3] But linear probes have been excellent at classifying concepts so far, so it wouldn't be too surprising if evading them seriously threw off [...]

---

Outline:

(00:36) Abstract

(02:01) Methodology

(03:30) Results

(03:33) Gemma 2 9B (layer 21, LoRA rank 8)

(04:29) Gemma 3 27B (layer 40, LoRA rank 16)

(05:40) Limitations/sanity checks

(06:59) Further work

The original text contained 10 footnotes which were omitted from this narration.

---

First published:
January 21st, 2026