LessWrong (Curated & Popular) cover image

“Reducing LLM deception at scale with self-other overlap fine-tuning” by Marc Carauleanu, Diogo de Lucena, Gunnar_Zarncke, Judd Rosenblatt, Mike Vaiana, Cameron Berg

LessWrong (Curated & Popular)

CHAPTER

Mitigating Deception in Language Models through Self-Other Overlap Fine-Tuning

This chapter examines the Self-Other Overlap (SOO) fine-tuning method designed to reduce deceptive behaviors in language models. It showcases experimental results demonstrating how this innovative approach improves reliability while maintaining performance across various contexts.

00:00
Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner