“Reducing LLM deception at scale with self-other overlap fine-tuning” by Marc Carauleanu, Diogo de Lucena, Gunnar_Zarncke, Judd Rosenblatt, Mike Vaiana, Cameron Berg

6 snips

Mar 17, 2025

Explore the groundbreaking Self-Other Overlap fine-tuning method designed to combat deception in language models. The podcast discusses experimental results showing a significant reduction in deceptive responses without sacrificing overall performance. Delve into innovative setups testing LLMs in tricky scenarios, like recommending rooms to potential burglars. Tune in to learn how this approach may pave the way for safer and more honest AI systems.