“Steering Evaluation-Aware Models to Act Like They Are Deployed” by Tim Hua, andrq, Sam Marks, Neel Nanda

Oct 30, 2025

Ask episode

Chapters

Transcript

Episode notes

📄Paper, 🖥️Code, 🤖Evaluation Aware Model Organism

TL, DR:;

We train an evaluation-aware LLM. Specifically, we train a model organism that writes Python type hints in evaluation but not in deployment. Additionally, it recognizes that a certain evaluation cue always means that it is being tested.
- The evaluation cue is intended to simulate hard-to-remove signs of an eval being artificial, e.g. being LLM-written.
- Our training process was designed to mimic how evaluation awareness could arise in practice: We first fine-tune the model on synthetic documents with factual descriptions of the model's context-dependent behavior and ability to recognize the evaluation cue. Then we train the model with expert iteration on evaluation data only to always use type hints.
- The resulting model uses types hints more during evaluation than in deployment.
- Prior work (Anthropic 2025; Abelnabi & Salem 2025) has shown that steering can stop the model talking about being evaluated, but lacks a ground truth of knowing how models behave if they are deployed. Having a model organism resolves this.
We show that activation steering can elicit the model's deployment behavior even when simple prompting fails.
- When the evaluation cue is present, the model writes Python type hints even if we add deployment hints (e.g. [...]

---

Outline:

(02:45) Extended Summary

(09:18) Commentary about the paper from Tim:

(09:52) Thoughts on how to steer models to believe they are deployed

(10:06) Contrastive prompts are good actually

(12:10) Steer on multiple layers at once

(13:09) Consider extracting the steering vector from an earlier checkpoint as well

(14:54) You should suppress evaluation awareness and run honeypot evaluations (and other alignment evaluations as well)

(15:24) Creating behavioral demonstrations of misalignment based on the steered model.

(16:46) Other things to do with a model that's been steered to believe it's deployed

(17:31) How others can build on this work

(18:07) Self-Review of the Paper's Strengths and Weaknesses:

(18:19) Main strengths of the paper:

(18:28) Qualitatively, it was pretty easy to elicit deployment behavior with activation steering.

(19:28) Steering can trick the model better than simple deployment prompts in our adversarial setting.

(20:47) Most sketchy parts of the paper:

(20:58) Llama Nemotron 49B is gullible and easy to trick.

(23:12) We rely on an extra situational awareness hint to create large gaps between deployment and evaluation behavior.

(24:28) Our final model organism struggles to tell deployment from evaluation using subtle clues.

(25:03) We did not explore models who always think they are being evaluated unless a certain deployment cue is present.

(26:13) Bottom Line: Will activation steering work for near-human level misaligned AIs?

(27:26) Other miscellaneous takes on evaluation awareness:

(28:44) Other miscellaneous takes on conducting this type of research

The original text contained 3 footnotes which were omitted from this narration.

---

First published:
October 30th, 2025

Source:
https://www.lesswrong.com/posts/peKrvZ6t9PSCzoQDa/steering-evaluation-aware-models-to-act-like-they-are

---

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.