The Nonlinear Library: LessWrong

LW - [Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations by Teun van der Weij

Jun 13, 2024
Ask episode
Chapters
Transcript
Episode notes