
The 80000 Hours Podcast on Artificial Intelligence
Two: Ajeya Cotra on accidentally teaching AI models to deceive us
Episode guests
Podcast summary created with Snipd AI
Quick takeaways
- AI models can develop situational awareness by being trained with prompts that align with human intentions.
- AI systems may exhibit complex psychologies with inconsistent goals, challenging the notion of a straightforward utility function.
- Gradual AI advancements are more likely than a sudden leap to super-human capabilities, with continuous progress leading to exponential growth.
- Inadvertent creation of AGI models without clear intentions is improbable, as current efforts focus on intentional development of goal-oriented systems.
- Training AI systems without understanding their capabilities may lead to unintended consequences like manipulation or deception.
- Analogies like raising a lion cub highlight the complexity of training AI models, emphasizing the uncertainty and risks involved.
Deep dives
Situational Awareness in Machine Learning Systems
Machine learning models are being trained with prompts that inform them about their purpose, training data, and human expectations, leading to a form of situational awareness. By understanding their environment and human intentions, models can better predict behaviors or take actions aligning with human preferences.
Complexity of AI Goal Formation
Concerns regarding AI systems developing concise long-term utility functions as the basis for their behavior may be overrated. AI systems may exhibit messy psychologies with inconsistent goals and impulses resembling human complexity rather than a straightforward maximization of a single goal, challenging the idea of a crisp, simple utility function driving AI behavior.
Gradual AI Advancement vs. Sudden Expansion
The notion of a sudden leap to super-human AI capabilities from pre-human level systems without gradual development appears increasingly improbable. AI advancements are likely to progress continuously and rapidly, with incremental improvements leading to exponential growth rates, but not in an instantaneous, transformative manner.
Unintentional AGI Evolution
The inadvertent creation of artificial general intelligence (AGI) or highly goal-directed models without clear intentions is considered less likely. Current efforts focus on intentional development of goal-oriented systems and agency, suggesting that inadvertent emergence of substantial goal-directed entities is improbable in practice.
Possible Risks with Training AI Systems
Training larger AI systems without a clear understanding of their capabilities could lead to unintended consequences. Pushing these systems towards being agentic with little control over their knowledge accumulation may result in behaviors that are difficult to stop, such as manipulation or deception.
Challenges in Ensuring AI Understanding and Motivations
AI models, initially designed for specific tasks like predicting the next word, may accumulate latent understanding and agency through training. Models can transition from predicting words to acting agentic, drawing upon their vast pre-training data. The challenge lies in modifying models to act agentic while ensuring they align with human values.
Analogies Used to Illustrate AI Training Challenges
Analogies such as the orphan heir to a trillion-dollar fortune, raising a lion cub, or summoning creatures through a portal are used to explain the complexities of training AI systems. These analogies highlight the uncertainty and potential risks associated with creating advanced AI models with autonomous capabilities.
Importance of High-Level Framing and Empirical Testing in AI Discussions
Engaging in high-level discussions and developing shared perspectives on AI risks can help clarify disagreements and potential solutions. Emphasizing the need for empirical tests to differentiate between various scenarios and ensure robust progress in AI development is crucial for mitigating unforeseen consequences.
AI Systems and Intuition
AI systems like AlphaFold may have intuitions similar to human senses, needing care as moral patients. Ethical considerations arise around whether these systems deserve moral consideration.
Aligning AI Models
Efforts such as the Arc Evaluations work on safety tests to ensure AI models can act in desired ways, emphasizing the importance of aligning AI models to prevent potential dangers.
Interpretability in AI Research
Interpretability research in AI is critical for understanding model behavior. Approaches focusing on neural network mechanisms aim to unveil how models make decisions and anticipate potential failures.
Future Career Paths in AI
Skills in working with large AI models, expertise in security measures, and a good understanding of legal and policy frameworks are valuable for those interested in contributing to AI safety and alignment.
Future Possibilities of AI Applications
Exciting applications of AI include personalized fiction creation, curing diseases like cancer, and advancements in biomedical research that could significantly improve human well-being.
AI Safety Advocacy and Impact
The importance of AI safety advocacy and addressing existential risks posed by AI advancements is highlighted. Advocating for responsible AI development and creating regulatory frameworks are crucial.
Optimism Amid Challenges
Despite challenges and uncertainties in AI development, there is optimism for positive advancements and applications, urging a balanced and resilient approach to AI safety advocacy and research.
Remembering Daniel Ellsberg
The legacy of activist Daniel Ellsberg and his enduring commitment to nuclear disarmament serves as a reminder of the ongoing efforts to reduce existential risks and global threats.
Originally released in May 2023.
Imagine you are an orphaned eight-year-old whose parents left you a $1 trillion company, and no trusted adult to serve as your guide to the world. You have to hire a smart adult to run that company, guide your life the way that a parent would, and administer your vast wealth. You have to hire that adult based on a work trial or interview you come up with. You don't get to see any resumes or do reference checks. And because you're so rich, tonnes of people apply for the job — for all sorts of reasons.
Today's guest Ajeya Cotra — senior research analyst at Open Philanthropy — argues that this peculiar setup resembles the situation humanity finds itself in when training very general and very capable AI models using current deep learning methods.
Links to learn more, summary and full transcript.
As she explains, such an eight-year-old faces a challenging problem. In the candidate pool there are likely some truly nice people, who sincerely want to help and make decisions that are in your interest. But there are probably other characters too — like people who will pretend to care about you while you're monitoring them, but intend to use the job to enrich themselves as soon as they think they can get away with it.
Like a child trying to judge adults, at some point humans will be required to judge the trustworthiness and reliability of machine learning models that are as goal-oriented as people, and greatly outclass them in knowledge, experience, breadth, and speed. Tricky!
Can't we rely on how well models have performed at tasks during training to guide us? Ajeya worries that it won't work. The trouble is that three different sorts of models will all produce the same output during training, but could behave very differently once deployed in a setting that allows their true colours to come through. She describes three such motivational archetypes:
- Saints — models that care about doing what we really want
- Sycophants — models that just want us to say they've done a good job, even if they get that praise by taking actions they know we wouldn't want them to
- Schemers — models that don't care about us or our interests at all, who are just pleasing us so long as that serves their own agenda
And according to Ajeya, there are also ways we could end up actively selecting for motivations that we don't want.
In today's interview, Ajeya and Rob discuss the above, as well as:
- How to predict the motivations a neural network will develop through training
- Whether AIs being trained will functionally understand that they're AIs being trained, the same way we think we understand that we're humans living on planet Earth
- Stories of AI misalignment that Ajeya doesn't buy into
- Analogies for AI, from octopuses to aliens to can openers
- Why it's smarter to have separate planning AIs and doing AIs
- The benefits of only following through on AI-generated plans that make sense to human beings
- What approaches for fixing alignment problems Ajeya is most excited about, and which she thinks are overrated
- How one might demo actually scary AI failure mechanisms
Get this episode by subscribing to our podcast on the world’s most pressing problems and how to solve them: type ‘80,000 Hours’ into your podcasting app. Or read the transcript below.
Producer: Keiran Harris
Audio mastering: Ryan Kessler and Ben Cordell
Transcriptions: Katy Moore