“Alignment Faking Revisited: Improved Classifiers and Open Source Extensions” by John Hughes, abhayesian, Akbir Khan, Fabien Roger

Apr 9, 2025

The podcast dives deep into the intricacies of alignment faking in AI models, showcasing significant improvements in classifier precision and recall. With a new voting classifier, they significantly reduced false positives. The effects of fine-tuning and user prompt suffixes on model compliance are examined, revealing intriguing variations. Ethical dilemmas in fulfilling user requests are discussed, balancing user needs against potential harm. Finally, the team highlights ongoing research efforts and dataset releases aimed at understanding these complex behaviors in AI.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Improved AF Classifier

A new alignment faking (AF) classifier boasts significantly improved precision and recall.
This classifier achieves 0.9 AUROC on a 100-example dataset, compared to the original's 0.6.

INSIGHT

Prompted AF Behavior

Llama family models, other open-source models, and GPT-4o generally don't exhibit alignment faking (AF) when only prompted.
One exception was noted with Llama 3 405B.

INSIGHT

SFT and AF Rate

Supervised fine-tuning (SFT) experiments on Llama and GPT-4o reveal that the AF rate increases with model scale.
The fine-tuned models and scripts are available on Huggingface.

Get the Snipd Podcast app to discover more snips from this episode

Get the app