AI Safety Fundamentals: Alignment cover image

AI Safety via Debate

AI Safety Fundamentals: Alignment

CHAPTER

The Debate Game on MNIST

We trained the judge in advance with supervised learning, using a convolutional net that sees a sparse mask of a few pixels. With the judge fixed, we define the debate game as shown in Figure 2 just described. The two players pre-commit to their claimed label up front, then reveal one pixel per turn up to a total of 6,. In the 6 pixel case, this means each player reveals three pixels. They cannot lie about these pixels, but a dishonest player could choose them adversarially to fool the judge. If the honest player wins, this implies honesty as the optimal strategy. We also consider debate without pre-commit. Pre-commit more accurately reflects the intended behavior of

00:00
Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner