
LessWrong (30+ Karma) “Steering Evaluation-Aware Models to Act Like They Are Deployed” by Tim Hua, andrq, Sam Marks, Neel Nanda
📄Paper, 🖥️Code, 🤖Evaluation Aware Model Organism
TL, DR:;
- We train an evaluation-aware LLM. Specifically, we train a model organism that writes Python type hints in evaluation but not in deployment. Additionally, it recognizes that a certain evaluation cue always means that it is being tested.
- The evaluation cue is intended to simulate hard-to-remove signs of an eval being artificial, e.g. being LLM-written.
- Our training process was designed to mimic how evaluation awareness could arise in practice: We first fine-tune the model on synthetic documents with factual descriptions of the model's context-dependent behavior and ability to recognize the evaluation cue. Then we train the model with expert iteration on evaluation data only to always use type hints.
- The resulting model uses types hints more during evaluation than in deployment.
- Prior work (Anthropic 2025; Abelnabi & Salem 2025) has shown that steering can stop the model talking about being evaluated, but lacks a ground truth of knowing how models behave if they are deployed. Having a model organism resolves this.
- We show that activation steering can elicit the model's deployment behavior even when simple prompting fails.
- When the evaluation cue is present, the model writes Python type hints even if we add deployment hints (e.g. [...]
---
Outline:
(02:45) Extended Summary
(09:18) Commentary about the paper from Tim:
(09:52) Thoughts on how to steer models to believe they are deployed
(10:06) Contrastive prompts are good actually
(12:10) Steer on multiple layers at once
(13:09) Consider extracting the steering vector from an earlier checkpoint as well
(14:54) You should suppress evaluation awareness and run honeypot evaluations (and other alignment evaluations as well)
(15:24) Creating behavioral demonstrations of misalignment based on the steered model.
(16:46) Other things to do with a model that's been steered to believe it's deployed
(17:31) How others can build on this work
(18:07) Self-Review of the Paper's Strengths and Weaknesses:
(18:19) Main strengths of the paper:
(18:28) Qualitatively, it was pretty easy to elicit deployment behavior with activation steering.
(19:28) Steering can trick the model better than simple deployment prompts in our adversarial setting.
(20:47) Most sketchy parts of the paper:
(20:58) Llama Nemotron 49B is gullible and easy to trick.
(23:12) We rely on an extra situational awareness hint to create large gaps between deployment and evaluation behavior.
(24:28) Our final model organism struggles to tell deployment from evaluation using subtle clues.
(25:03) We did not explore models who always think they are being evaluated unless a certain deployment cue is present.
(26:13) Bottom Line: Will activation steering work for near-human level misaligned AIs?
(27:26) Other miscellaneous takes on evaluation awareness:
(28:44) Other miscellaneous takes on conducting this type of research
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
October 30th, 2025
---
Narrated by TYPE III AUDIO.
---
Images from the article:








Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
