LessWrong (Curated & Popular) cover image

“Reducing LLM deception at scale with self-other overlap fine-tuning” by Marc Carauleanu, Diogo de Lucena, Gunnar_Zarncke, Judd Rosenblatt, Mike Vaiana, Cameron Berg

LessWrong (Curated & Popular)

00:00

Mitigating Deception in Language Models through Self-Other Overlap Fine-Tuning

This chapter examines the Self-Other Overlap (SOO) fine-tuning method designed to reduce deceptive behaviors in language models. It showcases experimental results demonstrating how this innovative approach improves reliability while maintaining performance across various contexts.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app