
LessWrong (Curated & Popular)
“Auditing language models for hidden objectives” by Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Akbir Khan, Euan Ong, Christopher Olah, Fabien Roger, Meg, Drake Thomas, Adam Jermyn, Monte M, evhub
Mar 16, 2025
24:14
We study alignment audits—systematic investigations into whether an AI is pursuing hidden objectives—by training a model with a hidden misaligned objective and asking teams of blinded researchers to investigate it.
This paper was a collaboration between the Anthropic Alignment Science and Interpretability teams.
Abstract
We study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden objective. Our training pipeline first teaches the model about exploitable errors in RLHF reward models (RMs), then trains the model to exploit some of these errors. We verify via out-of-distribution evaluations that the model generalizes to exhibit whatever behaviors it believes RMs rate highly, including ones not reinforced during training. We leverage this model to study alignment audits in two ways. First, we conduct a blind auditing game where four teams, unaware of the model's hidden objective or training [...]
---
Outline:
(00:26) Abstract
(01:48) Twitter thread
(04:55) Blog post
(07:55) Training a language model with a hidden objective
(11:00) A blind auditing game
(15:29) Alignment auditing techniques
(15:55) Turning the model against itself
(17:52) How much does AI interpretability help?
(22:49) Conclusion
(23:37) Join our team
The original text contained 5 images which were described by AI.
---
First published:
March 13th, 2025
Source:
https://www.lesswrong.com/posts/wSKPuBfgkkqfTpmWJ/auditing-language-models-for-hidden-objectives
---
Narrated by TYPE III AUDIO.
---
This paper was a collaboration between the Anthropic Alignment Science and Interpretability teams.
Abstract
We study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden objective. Our training pipeline first teaches the model about exploitable errors in RLHF reward models (RMs), then trains the model to exploit some of these errors. We verify via out-of-distribution evaluations that the model generalizes to exhibit whatever behaviors it believes RMs rate highly, including ones not reinforced during training. We leverage this model to study alignment audits in two ways. First, we conduct a blind auditing game where four teams, unaware of the model's hidden objective or training [...]
---
Outline:
(00:26) Abstract
(01:48) Twitter thread
(04:55) Blog post
(07:55) Training a language model with a hidden objective
(11:00) A blind auditing game
(15:29) Alignment auditing techniques
(15:55) Turning the model against itself
(17:52) How much does AI interpretability help?
(22:49) Conclusion
(23:37) Join our team
The original text contained 5 images which were described by AI.
---
First published:
March 13th, 2025
Source:
https://www.lesswrong.com/posts/wSKPuBfgkkqfTpmWJ/auditing-language-models-for-hidden-objectives
---
Narrated by TYPE III AUDIO.
---
Remember Everything You Learn from Podcasts
Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.