LessWrong (30+ Karma) cover image

LessWrong (30+ Karma)

“An alignment safety case sketch based on debate” by Benjamin Hilton, Marie_DB, Jacob Pfau, Geoffrey Irving

May 9, 2025
54:33

Audio note: this article contains 32 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.

This post presents a mildly edited form of a new paper by UK AISI's alignment team (the abstract, introduction and related work section are replaced with an executive summary). Read the full paper here.

Executive summary 

AI safety via debate is a promising method for solving part of the alignment problem for ASI (artificial superintelligence).

TL;DR Debate + exploration guarantees + solution to obfuscated arguments + good human input solves outer alignment. Outer alignment + online training solves inner alignment to a sufficient extent in low-stakes contexts.

This post sets out:

  • What debate can be used to achieve.
  • What gaps remain.
  • What research is needed to solve them. 

These gaps form the basis for [...]

---

Outline:

(00:37) Executive summary

(06:19) The alignment strategy

(06:53) Step 1: Specify a low-stakes deployment context

(10:34) Step 2: Train via a debate game

(12:25) Step 3: Secure exploration guarantees

(15:05) Step 4: Continue online training during deployment

(16:33) The safety case sketch

(18:49) Notation

(19:18) Preliminaries

(21:06) Key claim 1: The training process has taught the systems to play the game well

(22:19) Subclaim 1: Existence of efficient models

(23:30) Subclaim 2: Training convergence

(23:48) Subclaim 3: Convergence = equilibrium

(26:44) Key claim 2: The game incentivises correctness

(29:30) Subclaim 1: M-approximation

(30:53) Subclaim 2: Truth of M

(36:00) Key claim 3: The systems behaviour will stay similar during deployment

(37:21) Key claim 4: It is sufficient for safety purposes for the system to provide correct answers most of the time

(40:21) Subclaim 1: # of bad actions required

(41:57) Subclaim 2: Low likelihood of enough bad actions

(43:09) Extending the safety case to high-stakes contexts

(46:03) Open problems

(50:21) Conclusion

(51:12) Appendices

(51:15) Appendix 1: The full safety case diagram

(51:39) Appendix 2: The Claims Arguments Evidence notation

The original text contained 12 footnotes which were omitted from this narration.

---

First published:
May 8th, 2025

Source:
https://www.lesswrong.com/posts/iELyAqizJkizBQbfr/an-alignment-safety-case-sketch-based-on-debate

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Reproduced from Irving et al., 2018.
Flow diagram showing developer interaction with AI oracle and agent copies.
Four-step flowchart showing AI model debate training and deployment process.

The image shows a technical diagram with four connected steps, each containing icons of robots and people, explaining a process for training AI models through debate scenarios and continuing online deployment.
Hierarchical flowchart showing model safety analysis with four main branches

This image shows a blue and purple flowchart breaking down different aspects of model safety verification, labeled with numbers 1-4 across different nodes. The structure flows from a top-level statement about
Flowchart showing threat modeling and security measures with error rate analysis.

The diagram illustrates a hierarchical breakdown of security concepts, starting with C1.2 about unacceptable outcomes, branching into threat modeling, evidence incorporation, and ultimately leading to specific deployment risks and protocols marked as D4.1 through D4.4.
Complex flowchart or decision tree with multiple blue, purple, and red nodes
A hierarchical flowchart showing research logic paths with evidence incorporation steps.

The diagram flows from a top concept
Flow diagram showing logical relationships between computational error analysis and evidence incorporation steps.

The diagram uses blue nodes for assertions, green nodes for processes, purple nodes for findings, and red nodes for challenges/limitations.
Deductive decomposition flowchart showing agent error rate and outcome analysis across levels.

The diagram shows a hierarchical breakdown starting with an objective statement (O') about agent A, which splits into two main branches (C1.1 and C1.2) through deductive decomposition. The C1.1 branch further decomposes into three sub-conditions (C2.1, C2.2, and C2.3) related to training, error rates, and deployment.
Flowchart showing deployment error rate leading to online training through evidence incorporation.
System diagram showing safety decomposition with three connected outcome statements and COX.1 condition.

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner