“Detecting High-Stakes Interactions with Activation Probes” by Arrrlex, williambankes, Urja Pawar, Phil Bland, David Scott Krueger (formerly: capybaralet), Dmitrii Krasheninnikov

Jul 21, 2025

11:11

forum

Ask episode

view_agenda

Chapters

auto_awesome

Transcript

info_circle

Episode notes

This research was completed for LASR Labs 2025 by Alex McKenzie, Urja Pawar, Phil Blandfort and William Bankes. The team was supervised by Dmitrii Krasheninnikov, with additional guidance from Ekdeep Singh Lubana and support from David Krueger. The full paper can be found here.

TLDR – We train activation probes on Llama-3.3-70B to detect whether the current interaction is “high-stakes”. This “high-stakes” concept is safety-relevant as it is closely related to risk: 1) when the stakes are high the potential consequences are significant, and 2) high-stakes is closely related to pressure, which has been found to make LLMs behave more deceptively. Compared to black-box LLM-based classification methods, probes are much cheaper to run while showing performance similar to mid-size LLMs (8-12B) on out-of-distribution datasets. We also show promising results using probes as the first layer of a hierarchical monitoring pipeline.

Introduction

LLMs are everywhere now, yet these models are [...]

---

Outline:

(01:15) Introduction

(02:28) Synthetic training data with real-world OOD evaluation

(03:40) Attention and softmax probes performed best

(04:47) Comparison to LLM monitors

(06:18) Hierarchical monitoring outperforms either method alone

(07:39) Limitations and failure modes

(08:27) Implications

(10:19) Conclusion