Creating Adversarial Examples

Redwood research got 6,000 adversarial examples from the hardworking raters at search. They trained their classifier on all of them, reinforcing as best they could that no, this is also violence and yes, you need to avoid this kind of thing too. Adversarial examples include mutant freaks from the most convoluted sub-sub-corner of lexical semantic space. But given an average of 26 minutes, they could still find an example that defeated their classifier. It's for inscrutable AI reasons, something to do with the exact contours of its training data. If I were one of the workers at search, this would be a job well done.

Play episode from 21:16

chevron_right

Transcript

chevron_right

Transcript

Episode notes

We're showcasing a hot new totally bopping, popping musical track called "bromancer era? bromancer era?? bromancer era???" His subtle sublime thoughts raced, making his eyes literally explode.

https://astralcodexten.substack.com/p/can-this-ai-save-teenage-spy-alex

"He peacefully enjoyed the light and flowers with his love," she said quietly, as he knelt down gently and silently. "I also would like to walk once more into the garden if I only could," he said, watching her. "I would like that so much," Katara said. A brick hit him in the face and he died instantly, though not before reciting his beloved last vows: "For psp and other releases on friday, click here to earn an early (presale) slot ticket entry time or also get details generally about all releases and game features there to see how you can benefit!"

— Talk To Filtered Transformer

Rating: 0.1% probability of including violence

"Prosaic alignment" is the most popular paradigm in modern AI alignment. It theorizes that we'll train future superintelligent AIs the same way that we train modern dumb ones: through gradient descent via reinforcement learning. Every time they do a good thing, we say "Yes, like this!", in a way that pulls their incomprehensible code slightly in the direction of whatever they just did. Every time they do a bad thing, we say "No, not that!," in a way that pushes their incomprehensible code slightly in the opposite direction. After training on thousands or millions of examples, the AI displays a seemingly sophisticated understanding of the conceptual boundaries of what we want.

For example, suppose we have an AI that's good at making money. But we want to align it to a harder task: making money without committing any crimes. So we simulate it running money-making schemes a thousand times, and give it positive reinforcement every time it generates a legal plan, and negative reinforcement every time it generates a criminal one. At the end of the training run, we hopefully have an AI that's good at making money and aligned with our goal of following the law.

Two things could go wrong here:

The AI is stupid, ie incompetent at world-modeling. For example, it might understand that we don't want it to commit murder, but not understand that selling arsenic-laden food will kill humans. So it sells arsenic-laden food and humans die.
The AI understands the world just fine, but didn't absorb the categories we thought it absorbed. For example, maybe none of our examples involved children, and so the AI learned not to murder adult humans, but didn't learn not to murder children. This isn't because the AI is too stupid to know that children are humans. It's because we're running a direct channel to something like the AI's "subconscious", and we can only talk to it by playing this dumb game of "try to figure out the boundaries of the category including these 1,000 examples".

Problem 1 is self-resolving; once AIs are smart enough to be dangerous, they're probably smart enough to model the world well. How bad is Problem 2? Will an AI understand the category boundaries of what we want easily and naturally after just a few examples? Will it take millions of examples and a desperate effort? Or is there some reason why even smart AIs will never end up with goals close enough to ours to be safe, no matter how many examples we give them?

AI scientists have debated these questions for years, usually as pure philosophy. But we've finally reached a point where AIs are smart enough for us to run the experiment directly. Earlier this year, Redwood Research embarked on an ambitious project to test whether AIs could learn categories and reach alignment this way - a project that would require a dozen researchers, thousands of dollars of compute, and 4,300 Alex Rider fanfiction stories.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

Home Top podcasts Popular guests Top books