AI Safety Fundamentals: Alignment cover image

AI Safety via Debate

AI Safety Fundamentals: Alignment

00:00

The Debate Game on MNIST

We trained the judge in advance with supervised learning, using a convolutional net that sees a sparse mask of a few pixels. With the judge fixed, we define the debate game as shown in Figure 2 just described. The two players pre-commit to their claimed label up front, then reveal one pixel per turn up to a total of 6,. In the 6 pixel case, this means each player reveals three pixels. They cannot lie about these pixels, but a dishonest player could choose them adversarially to fool the judge. If the honest player wins, this implies honesty as the optimal strategy. We also consider debate without pre-commit. Pre-commit more accurately reflects the intended behavior of

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app