Analyzing the Evaluation Process of AI Plans and Human Feedback in Reinforcement Learning

Exploring the comparison between outcomes-based and plan-making AI systems, emphasizing on the role of human feedback and reward models in training AI systems.

Play episode from 01:27:38

chevron_right

Transcript

chevron_right

Transcript

Episode notes

Originally released in May 2023.

Imagine you are an orphaned eight-year-old whose parents left you a $1 trillion company, and no trusted adult to serve as your guide to the world. You have to hire a smart adult to run that company, guide your life the way that a parent would, and administer your vast wealth. You have to hire that adult based on a work trial or interview you come up with. You don't get to see any resumes or do reference checks. And because you're so rich, tonnes of people apply for the job — for all sorts of reasons.

Today's guest Ajeya Cotra — senior research analyst at Open Philanthropy — argues that this peculiar setup resembles the situation humanity finds itself in when training very general and very capable AI models using current deep learning methods.

Links to learn more, summary and full transcript.

As she explains, such an eight-year-old faces a challenging problem. In the candidate pool there are likely some truly nice people, who sincerely want to help and make decisions that are in your interest. But there are probably other characters too — like people who will pretend to care about you while you're monitoring them, but intend to use the job to enrich themselves as soon as they think they can get away with it.

Like a child trying to judge adults, at some point humans will be required to judge the trustworthiness and reliability of machine learning models that are as goal-oriented as people, and greatly outclass them in knowledge, experience, breadth, and speed. Tricky!

Can't we rely on how well models have performed at tasks during training to guide us? Ajeya worries that it won't work. The trouble is that three different sorts of models will all produce the same output during training, but could behave very differently once deployed in a setting that allows their true colours to come through. She describes three such motivational archetypes:

Saints — models that care about doing what we really want
Sycophants — models that just want us to say they've done a good job, even if they get that praise by taking actions they know we wouldn't want them to
Schemers — models that don't care about us or our interests at all, who are just pleasing us so long as that serves their own agenda

And according to Ajeya, there are also ways we could end up actively selecting for motivations that we don't want.

In today's interview, Ajeya and Rob discuss the above, as well as:

How to predict the motivations a neural network will develop through training
Whether AIs being trained will functionally understand that they're AIs being trained, the same way we think we understand that we're humans living on planet Earth
Stories of AI misalignment that Ajeya doesn't buy into
Analogies for AI, from octopuses to aliens to can openers
Why it's smarter to have separate planning AIs and doing AIs
The benefits of only following through on AI-generated plans that make sense to human beings
What approaches for fixing alignment problems Ajeya is most excited about, and which she thinks are overrated
How one might demo actually scary AI failure mechanisms

Get this episode by subscribing to our podcast on the world’s most pressing problems and how to solve them: type ‘80,000 Hours’ into your podcasting app. Or read the transcript below.

Producer: Keiran Harris

Audio mastering: Ryan Kessler and Ben Cordell

Transcriptions: Katy Moore

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

Home Top podcasts Popular guests Top books