LessWrong (Curated & Popular) cover image

“Alignment Faking Revisited: Improved Classifiers and Open Source Extensions” by John Hughes, abhayesian, Akbir Khan, Fabien Roger

LessWrong (Curated & Popular)

00:00

Detecting Alignment Faking in AI Models

This chapter investigates new classifiers aimed at distinguishing between alignment faking and true compliance in AI models, showcasing a significant reduction in false positive rates. It offers insights into the effects of fine-tuning on model compliance behaviors, highlighting the complexities of classifier performances across different scales and methods. The findings also emphasize the need for continued research into the motivations behind compliance and alignment faking in large language models.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app