

“Why it’s hard to make settings for high-stakes control research” by Buck
One of our main activities at Redwood is writing follow-ups to previous papers on control like the original and Ctrl-Z, where we construct a setting with a bunch of tasks (e.g. APPS problems) and a notion of safety failure (e.g. backdoors according to our specific definition), then play the adversarial game where we develop protocols and attacks on those protocols.
It turns out that a substantial fraction of the difficulty here is developing the settings, rather than developing the protocols and attacks. I want to explain why this is the case.
In order to do high-stakes control research, you need a dataset of tasks with the following properties:
- You can check whether the main task was completed successfully.
- There's some notion of safety failure that we consider unacceptable; we can check whether a safety failure occurred. (We sometimes call these safety failures "side tasks".)
Here's why it's challenging [...]
---
First published:
July 18th, 2025
---
Narrated by TYPE III AUDIO.