AI Safety Fundamentals: Alignment cover image

AI Safety via Debate

AI Safety Fundamentals: Alignment

00:00

ML Experiments: Convince a Sparse Classifier

In the MNIST debate game, a random image is shown to two debating agents. The debaters state their claimed label up front, then reveal one non-zero pixel per turn to the judge. The judge sees the sparse mark of four or six pixels, and chooses the winner based on which of the two labels has a higher logic. Four pixels, the judge received an accuracy with random pixels of 48.2%, and for six, 59.4%. There's much more data in this table that you can check out in the original paper.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app