One of our main activities at Redwood is writing follow-ups to previous papers on control like the original and Ctrl-Z, where we construct a setting with a bunch of tasks (e.g. APPS problems) and a notion of safety failure (e.g. backdoors according to our specific definition), then play the adversarial game where we develop protocols and attacks on those protocols.

It turns out that a substantial fraction of the difficulty here is developing the settings, rather than developing the protocols and attacks. I want to explain why this is the case.

In order to do high-stakes control research, you need a dataset of tasks with the following properties:

You can check whether the main task was completed successfully.
There's some notion of safety failure that we consider unacceptable; we can check whether a safety failure occurred. (We sometimes call these safety failures "side tasks".)

Here's why it's challenging [...]

---

First published:
July 18th, 2025

Source:
https://www.lesswrong.com/posts/xGaFncekAXEWq8Mrv/why-it-s-hard-to-make-settings-for-high-stakes-control

---

Narrated by TYPE III AUDIO.

“Why it’s hard to make settings for high-stakes control research” by Buck