LessWrong (Curated & Popular) cover image

“Alignment Faking Revisited: Improved Classifiers and Open Source Extensions” by John Hughes, abhayesian, Akbir Khan, Fabien Roger

LessWrong (Curated & Popular)

CHAPTER

Analysis of Compliance Rates in Language Models

This chapter explores the compliance rates of LAMA and GPT language models in relation to suffix presence, utilizing bar graphs to illustrate their behaviors. It reveals that extended training improves compliance but minimally impacts other performance metrics.

00:00
Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner