
LessWrong (30+ Karma) “Neural chameleons can(’t) hide from activation oracles” by ceselder
Jan 22, 2026
Explore the intriguing world of neural chameleons, models adept at evading linear probes while remaining coherent. Delve into the power of activation oracles, capable of producing nuanced insights about model activations. Discover mixed results from different model organisms, showcasing how activation oracle detection varies wildly. Seselda discusses the surprising resilience of linear probes and the challenges of cross-concept generalization, opening pathways for future adversarial training. It's a deep dive into the mechanics of AI perception!
AI Snips
Chapters
Transcript
Episode notes
Chameleons Evade Probes More Than Oracles
- Neural chameleons can be trained to hide from linear probes while remaining coherent.
- Activation oracles also suffer from chameleons but less consistently and less severely than probes.
Use Holdouts And Realistic Triggers
- Test probe evasion using held-out concepts to measure generalization rather than only trained concepts.
- Use trigger phrases and last-token activations to simulate realistic adversarial probing conditions.
Smaller Model Keeps Some Oracle Signals
- On Gemma 2 9B the chameleon generalized well to hide from many linear probes.
- Activation oracles still detected some concepts reliably, notably 'birds', despite probe evasion.





