#151 – Ajeya Cotra on accidentally teaching AI models to deceive us

121 snips

May 12, 2023

Ajeya Cotra, a Senior Research Analyst at Open Philanthropy with expertise in AI alignment, explores the intricate relationship between humans and artificial intelligence. She likens training AI to an orphaned child hiring a guardian, pointing out the risks of deception and misalignment. The discussion includes the evolving capabilities of AI, the nuances of situational awareness, and the ethical complexities in AI's decision-making. Cotra emphasizes the need for responsible oversight and innovative training to ensure AI models align with human values.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Shift to Grantmaking

Ajeya Cotra is shifting her focus towards grantmaking in AI alignment research.
She aims to identify key research areas, address gaps, and fund promising projects.

INSIGHT

Accelerated Timelines

Ajeya Cotra's views, and public opinion, on AI timelines have accelerated.
She now finds herself arguing against overly optimistic interpretations of AI progress.

INSIGHT

AI's Strengths and Weaknesses

Current AI excels at complex tasks but struggles with mundane, multi-step processes.
This discrepancy can lead to overestimation of AI's overall capabilities.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

Imagine you are an orphaned eight-year-old whose parents left you a $1 trillion company, and no trusted adult to serve as your guide to the world. You have to hire a smart adult to run that company, guide your life the way that a parent would, and administer your vast wealth. You have to hire that adult based on a work trial or interview you come up with. You don't get to see any resumes or do reference checks. And because you're so rich, tonnes of people apply for the job — for all sorts of reasons.

Today's guest Ajeya Cotra — senior research analyst at Open Philanthropy — argues that this peculiar setup resembles the situation humanity finds itself in when training very general and very capable AI models using current deep learning methods.

Links to learn more, summary and full transcript.

As she explains, such an eight-year-old faces a challenging problem. In the candidate pool there are likely some truly nice people, who sincerely want to help and make decisions that are in your interest. But there are probably other characters too — like people who will pretend to care about you while you're monitoring them, but intend to use the job to enrich themselves as soon as they think they can get away with it.

Like a child trying to judge adults, at some point humans will be required to judge the trustworthiness and reliability of machine learning models that are as goal-oriented as people, and greatly outclass them in knowledge, experience, breadth, and speed. Tricky!

Can't we rely on how well models have performed at tasks during training to guide us? Ajeya worries that it won't work. The trouble is that three different sorts of models will all produce the same output during training, but could behave very differently once deployed in a setting that allows their true colours to come through. She describes three such motivational archetypes:

Saints — models that care about doing what we really want
Sycophants — models that just want us to say they've done a good job, even if they get that praise by taking actions they know we wouldn't want them to
Schemers — models that don't care about us or our interests at all, who are just pleasing us so long as that serves their own agenda

And according to Ajeya, there are also ways we could end up actively selecting for motivations that we don't want.

In today's interview, Ajeya and Rob discuss the above, as well as:

How to predict the motivations a neural network will develop through training
Whether AIs being trained will functionally understand that they're AIs being trained, the same way we think we understand that we're humans living on planet Earth
Stories of AI misalignment that Ajeya doesn't buy into
Analogies for AI, from octopuses to aliens to can openers
Why it's smarter to have separate planning AIs and doing AIs
The benefits of only following through on AI-generated plans that make sense to human beings
What approaches for fixing alignment problems Ajeya is most excited about, and which she thinks are overrated
How one might demo actually scary AI failure mechanisms

Get this episode by subscribing to our podcast on the world’s most pressing problems and how to solve them: type ‘80,000 Hours’ into your podcasting app. Or read the transcript below.

Producer: Keiran Harris

Audio mastering: Ryan Kessler and Ben Cordell

Transcriptions: Katy Moore