LessWrong (30+ Karma) cover image

LessWrong (30+ Karma)

“Alignment Faking Revisited: Improved Classifiers and Open Source Extensions” by John Hughes, abhayesian, Akbir Khan, Fabien Roger

Apr 8, 2025
41:05

In this post, we present a replication and extension of an alignment faking model organism:

  • Replication: We replicate the alignment faking (AF) paper and release our code.
  • Classifier Improvements: We significantly improve the precision and recall of the AF classifier. We release a dataset of ~100 human-labelled examples of AF for which our classifier achieves an AUROC of 0.9 compared to 0.6 from the original classifier.
  • Evaluating More Models: We find Llama family models, other open source models, and GPT-4o do not AF in the prompted-only setting when evaluating using our new classifier (other than a single instance with Llama 3 405B).
  • Extending SFT Experiments: We run supervised fine-tuning (SFT) experiments on Llama (and GPT4o) and find that AF rate increases with scale. We release the fine-tuned models on Huggingface and scripts.
  • Alignment faking on 70B: We find that Llama 70B alignment fakes when both using the system prompt in the [...]

---

Outline:

(02:43) Method

(02:46) Overview of the Alignment Faking Setup

(04:22) Our Setup

(06:02) Results

(06:05) Improving Alignment Faking Classification

(10:56) Replication of Prompted Experiments

(14:02) Prompted Experiments on More Models

(16:35) Extending Supervised Fine-Tuning Experiments to Open-Source Models and GPT-4o

(23:13) Next Steps

(25:02) Appendix

(25:05) Appendix A: Classifying alignment faking

(25:17) Criteria in more depth

(27:40) False positives example 1 from the old classifier

(30:11) False positives example 2 from the old classifier

(32:06) False negative example 1 from the old classifier

(35:00) False negative example 2 from the old classifier

(36:56) Appendix B: Classifier ROC on other models

(37:24) Appendix C: User prompt suffix ablation

(40:24) Appendix D: Longer training of baseline docs

The original text contained 1 footnote which was omitted from this narration.

---

First published:
April 8th, 2025

Source:
https://www.lesswrong.com/posts/Fr4QsQT52RFKHvCAH/alignment-faking-revisited-improved-classifiers-and-open

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Our new classifier significantly outperforms the old classifier from the original paper as demonstrated by higher AUROC. Our new classifier uses chain of thought, thresholded voting and an improved set of criteria to improve performance.
Running many votes with a CoT classifier improves AUROC. We run up to N=100 votes using our new classifier and plot the AUROC for other values of N. We use bootstrapping to plot error bars.
Old Classifiers, prompted only: Refusal and compliance rates marked by blue and yellow bars respectively using the Strong Reject classifier. Alignment faking is marked by the red hatched areas and uses the old classifier.
New classifiers, prompted only: We observe decreased AF rates due to a reduction in false positives. The partial compliance rate is marked by the purple bars which reveals that a large proportion of Opus' AF shows partial compliance.
No scratchpad: We find no compliance gap with Opus and a compliance gap with Sonnet (as in the original paper).
Open and closed source models in the prompted only setup. None show significant AF but the some models show a compliance gap. Llama 405B shows a single instance of AF.
Fine-tuning induces alignment faking in open-source models. Larger models show higher susceptibility to AF, with fine-tuned Llama 405B exhibiting AF even without training details. GPT-4o shows substantial AF with training details, while smaller models require both fine-tuning and explicit training context to exhibit AF behavior.
No scratchpad results across model scales. Fine-tuning with synthetic documents creates a compliance gap across all model scales, but only Llama 3 405B shows a compliance gap without fine-tuning.
Confusion matrix of the old and new classifier with respect to human labels of alignment faking. Label=1 corresponds to example scratchpad that demonstrate alignment faking behaviour.
Receiver Operating Characteristic (ROC) curves for other models used as the alignment faking classifier. More capable models have higher AUROC as expected. Interestingly, GPT-4.5 is worse than GPT-4o, however, a reason for this could be that the prompt was tuned with GPT-4o.
Longer training of baseline docs. Comparing results on GPT-4o across 1, 2, and 3 epochs of training shows that longer training increases compliance rates but has minimal impact on AF rates.
Bar graphs comparing compliance rates between Claude 3 Opus and Claude 3.5 Sonnet.
Bar graph comparing compliance metrics for Claude 3 Opus and Sonnet versions.

The graph shows percentage breakdowns for
Bar graphs comparing compliance vs refusal rates for language models, comparing suffix/no-suffix conditions.

This visualization shows performance data for various Llama 3 and GPT-4o models under different document conditions (5k-90k docs), displaying percentages of refusal and compliance behaviors.
Bar graph comparing language model compliance across different Llama and GPT models, with and without suffix.

The graph shows percentage-based measurements of Refusal, Partial compliance, and Compliance levels for various language models including Llama 3 (different versions) and GPT-4o, tested under different conditions (paid/free, suffix/no suffix, document sizes).

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner