LessWrong (Curated & Popular)

“Auditing language models for hidden objectives” by Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Akbir Khan, Euan Ong, Christopher Olah, Fabien Roger, Meg, Drake Thomas, Adam Jermyn, Monte M, evhub

Mar 16, 2025
Ask episode
Chapters
Transcript
Episode notes