LessWrong (30+ Karma)

“Detecting High-Stakes Interactions with Activation Probes” by Arrrlex, williambankes, Urja Pawar, Phil Bland, David Scott Krueger (formerly: capybaralet), Dmitrii Krasheninnikov

Jul 21, 2025
Ask episode
Chapters
Transcript
Episode notes