AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
Machine learning models are being trained with prompts that inform them about their purpose, training data, and human expectations, leading to a form of situational awareness. By understanding their environment and human intentions, models can better predict behaviors or take actions aligning with human preferences.
Concerns regarding AI systems developing concise long-term utility functions as the basis for their behavior may be overrated. AI systems may exhibit messy psychologies with inconsistent goals and impulses resembling human complexity rather than a straightforward maximization of a single goal, challenging the idea of a crisp, simple utility function driving AI behavior.
The notion of a sudden leap to super-human AI capabilities from pre-human level systems without gradual development appears increasingly improbable. AI advancements are likely to progress continuously and rapidly, with incremental improvements leading to exponential growth rates, but not in an instantaneous, transformative manner.
The inadvertent creation of artificial general intelligence (AGI) or highly goal-directed models without clear intentions is considered less likely. Current efforts focus on intentional development of goal-oriented systems and agency, suggesting that inadvertent emergence of substantial goal-directed entities is improbable in practice.
Training larger AI systems without a clear understanding of their capabilities could lead to unintended consequences. Pushing these systems towards being agentic with little control over their knowledge accumulation may result in behaviors that are difficult to stop, such as manipulation or deception.
AI models, initially designed for specific tasks like predicting the next word, may accumulate latent understanding and agency through training. Models can transition from predicting words to acting agentic, drawing upon their vast pre-training data. The challenge lies in modifying models to act agentic while ensuring they align with human values.
Analogies such as the orphan heir to a trillion-dollar fortune, raising a lion cub, or summoning creatures through a portal are used to explain the complexities of training AI systems. These analogies highlight the uncertainty and potential risks associated with creating advanced AI models with autonomous capabilities.
Engaging in high-level discussions and developing shared perspectives on AI risks can help clarify disagreements and potential solutions. Emphasizing the need for empirical tests to differentiate between various scenarios and ensure robust progress in AI development is crucial for mitigating unforeseen consequences.
AI systems like AlphaFold may have intuitions similar to human senses, needing care as moral patients. Ethical considerations arise around whether these systems deserve moral consideration.
Efforts such as the Arc Evaluations work on safety tests to ensure AI models can act in desired ways, emphasizing the importance of aligning AI models to prevent potential dangers.
Interpretability research in AI is critical for understanding model behavior. Approaches focusing on neural network mechanisms aim to unveil how models make decisions and anticipate potential failures.
Skills in working with large AI models, expertise in security measures, and a good understanding of legal and policy frameworks are valuable for those interested in contributing to AI safety and alignment.
Exciting applications of AI include personalized fiction creation, curing diseases like cancer, and advancements in biomedical research that could significantly improve human well-being.
The importance of AI safety advocacy and addressing existential risks posed by AI advancements is highlighted. Advocating for responsible AI development and creating regulatory frameworks are crucial.
Despite challenges and uncertainties in AI development, there is optimism for positive advancements and applications, urging a balanced and resilient approach to AI safety advocacy and research.
The legacy of activist Daniel Ellsberg and his enduring commitment to nuclear disarmament serves as a reminder of the ongoing efforts to reduce existential risks and global threats.
Originally released in May 2023.
Imagine you are an orphaned eight-year-old whose parents left you a $1 trillion company, and no trusted adult to serve as your guide to the world. You have to hire a smart adult to run that company, guide your life the way that a parent would, and administer your vast wealth. You have to hire that adult based on a work trial or interview you come up with. You don't get to see any resumes or do reference checks. And because you're so rich, tonnes of people apply for the job — for all sorts of reasons.
Today's guest Ajeya Cotra — senior research analyst at Open Philanthropy — argues that this peculiar setup resembles the situation humanity finds itself in when training very general and very capable AI models using current deep learning methods.
Links to learn more, summary and full transcript.
As she explains, such an eight-year-old faces a challenging problem. In the candidate pool there are likely some truly nice people, who sincerely want to help and make decisions that are in your interest. But there are probably other characters too — like people who will pretend to care about you while you're monitoring them, but intend to use the job to enrich themselves as soon as they think they can get away with it.
Like a child trying to judge adults, at some point humans will be required to judge the trustworthiness and reliability of machine learning models that are as goal-oriented as people, and greatly outclass them in knowledge, experience, breadth, and speed. Tricky!
Can't we rely on how well models have performed at tasks during training to guide us? Ajeya worries that it won't work. The trouble is that three different sorts of models will all produce the same output during training, but could behave very differently once deployed in a setting that allows their true colours to come through. She describes three such motivational archetypes:
And according to Ajeya, there are also ways we could end up actively selecting for motivations that we don't want.
In today's interview, Ajeya and Rob discuss the above, as well as:
Get this episode by subscribing to our podcast on the world’s most pressing problems and how to solve them: type ‘80,000 Hours’ into your podcasting app. Or read the transcript below.
Producer: Keiran Harris
Audio mastering: Ryan Kessler and Ben Cordell
Transcriptions: Katy Moore
Listen to all your favourite podcasts with AI-powered features
Listen to the best highlights from the podcasts you love and dive into the full episode
Hear something you like? Tap your headphones to save it with AI-generated key takeaways
Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more
Listen to all your favourite podcasts with AI-powered features
Listen to the best highlights from the podcasts you love and dive into the full episode