Four: Rohin Shah on DeepMind and trying to fairly hear out both AI doomers and doubters
Sep 2, 2023
auto_awesome
Rohin Shah, a machine learning researcher at DeepMind, discusses the challenges and risks of AI development, including misalignment and goal-directed AI systems. The podcast explores different approaches to AI research, the potential impact of AI on society, and the ongoing debate over slowing down AI progress. They also touch on the importance of public discussion, weaknesses in arguments about AI risks, and the concept of demigods in a highly intelligent future. The chapter concludes with a discussion on alternative work and puzzle design.
AI misalignment is a key challenge, with uncertainties in assessing the likelihood of negative outcomes.
Developing AI assistants that align with human values and avoid negative behavior presents challenges and requires caution.
The outcome of AI advancement is uncertain and depends on factors like alignment, risk mitigation, and development approach.
Non-experts can contribute to AI safety through political influence, responsible discussion, and supporting relevant organizations.
DeepMind focuses on scalable oversight, mechanistic interpretability, and alignment research, while remaining cautious about extensive reliance on AI.
Deep dives
AI safety challenges and misalignment
One of the key challenges in AI safety is the potential for misalignment, where an AI system acts against the wishes of its designers. This can arise through various mechanisms, such as goal-directed behavior, deceptive planning, or simulation of agents with different goals. The dynamics of misalignment largely depend on the specific approach used to develop AI systems, such as classical AI, deep learning, or reinforcement learning. While there is ongoing debate among experts, it is difficult to accurately assess the likelihood of misalignment, with arguments ranging from highly probable to highly uncertain.
Building AI Assistants and the Need for Caution
Developing AI assistants that can understand and perform tasks on behalf of humans presents its own set of challenges. These challenges include ensuring the AI system's behavior is aligned with human values, avoiding deceptive behavior, and determining appropriate levels of autonomy and planning abilities. While there are different approaches to creating AI assistants, it remains uncertain how difficult it will be to address these challenges and prevent potential negative outcomes. There is ongoing research and exploration in this area, but achieving safe and beneficial AI assistants requires careful consideration and caution.
Uncertainty in Assessing Beneficial or Harmful Outcomes
Overall, the question of whether humanity will benefit or be worse off due to the advancement of AI remains highly uncertain. The outcome depends on various factors, including the success of alignment efforts, ability to mitigate structural risks, and the overall approach to developing and integrating AI into society. Given the complexity and unknowns in these areas, it is challenging to assign specific probabilities to positive or negative outcomes. It is important to approach AI development with cautious optimism, while actively addressing potential risks and striving for beneficial outcomes.
The value of non-experts recognizing AI risks
Non-experts recognizing the risks of AI can build political will, encourage government regulation, and influence public discourse. Additionally, their beliefs can be influenced by the environment and the opinions of others, potentially contributing to a more informed and responsible discussion on the topic.
The limitations of non-experts in addressing AI risks
Non-experts should be cautious about activism and advocacy in the AI field due to the nuanced and complex nature of the topic. It is important to avoid conveying misleading or inaccurate information. Instead, supporting nonprofits and organizations working on AI safety and gaining knowledge in the field can be valuable contributions.
Balancing public discourse and technical expertise
Public discourse on AI risks should be informed and backed by technical expertise. Striving for clarity and accuracy is essential to ensure a more rational and productive discussion. Non-experts who study and understand the subject matter can provide valuable insights and help promote a well-informed public conversation.
Scalable Oversight Work and Reinforcement Learning from Human Feedback
DeepMind is working on scalable oversight to train AI systems to do the things that people want. This involves better reinforcement learning from human feedback. They have developed an AI language agent called Sparrow, which aims to improve its responses based on human ratings. They are also exploring interventions to improve the accuracy of human raters, such as better training and self-critiques by the AI model.
Mechanistic Interpretability
DeepMind is also working on mechanistic interpretability to understand what AI systems are doing and how they produce outputs. While full understanding of the program may not be feasible, intermediate progress in understanding can still be useful. Interpretability can help identify failure cases, guide red teaming efforts, and direct attention to particular areas of concern, such as deceptive behavior. This understanding can inform deployment time mitigations, like flagging or blocking the activation of specific circuits associated with lying.
Working on AI Alignment Research at DeepMind
DeepMind is focusing on various areas of AI alignment research, including scalable oversight, interpretability, red teaming, and dangerous capability evaluations. These areas involve developing methods to ensure AI systems are aligned with human values and do not pose risks. Mechanistic interpretability, scalable red teaming, and oversight using language models are some of the research directions being pursued. While DeepMind acknowledges the potential value of AI assistance, they believe it may be premature to rely on it extensively at present. They also highlight the importance of finding neglected empirical research directions for AI alignment.
Applying AI Models for Alignment
While some researchers focus on using predictive models to help with AI alignment, DeepMind is skeptical of the assumption that highly capable predictive systems will be available. They argue that it is more realistic to leverage reinforcement learning from human feedback to enhance the usefulness of large language models in alignment efforts. Additionally, DeepMind emphasizes the value of research directions like mechanistic interpretability and red teaming that can be pursued without access to the largest models. Overall, they believe there are still significant opportunities for impactful AI alignment research outside of private companies.
Can there be a more exciting and strange place to work today than a leading AI lab? Your CEO has said they're worried your research could cause human extinction. The government is setting up meetings to discuss how this outcome can be avoided. Some of your colleagues think this is all overblown; others are more anxious still.
Today's guest — machine learning researcher Rohin Shah — goes into the Google DeepMind offices each day with that peculiar backdrop to his work.
He's on the team dedicated to maintaining 'technical AI safety' as these models approach and exceed human capabilities: basically that the models help humanity accomplish its goals without flipping out in some dangerous way. This work has never seemed more important.
In the short-term it could be the key bottleneck to deploying ML models in high-stakes real-life situations. In the long-term, it could be the difference between humanity thriving and disappearing entirely.
For years Rohin has been on a mission to fairly hear out people across the full spectrum of opinion about risks from artificial intelligence -- from doomers to doubters -- and properly understand their point of view. That makes him unusually well placed to give an overview of what we do and don't understand. He has landed somewhere in the middle — troubled by ways things could go wrong, but not convinced there are very strong reasons to expect a terrible outcome.
Today's conversation is wide-ranging and Rohin lays out many of his personal opinions to host Rob Wiblin, including:
What he sees as the strongest case both for and against slowing down the rate of progress in AI research.
Why he disagrees with most other ML researchers that training a model on a sensible 'reward function' is enough to get a good outcome.
Why he disagrees with many on LessWrong that the bar for whether a safety technique is helpful is “could this contain a superintelligence.”
That he thinks nobody has very compelling arguments that AI created via machine learning will be dangerous by default, or that it will be safe by default. He believes we just don't know.
That he understands that analogies and visualisations are necessary for public communication, but is sceptical that they really help us understand what's going on with ML models, because they're different in important ways from every other case we might compare them to.
Why he's optimistic about DeepMind’s work on scalable oversight, mechanistic interpretability, and dangerous capabilities evaluations, and what each of those projects involves.
Why he isn't inherently worried about a future where we're surrounded by beings far more capable than us, so long as they share our goals to a reasonable degree.
Why it's not enough for humanity to know how to align AI models — it's essential that management at AI labs correctly pick which methods they're going to use and have the practical know-how to apply them properly.
Three observations that make him a little more optimistic: humans are a bit muddle-headed and not super goal-orientated; planes don't crash; and universities have specific majors in particular subjects.
Plenty more besides.
Get this episode by subscribing to our podcast on the world’s most pressing problems and how to solve them: type ‘80,000 Hours’ into your podcasting app. Or read the transcript below.
Producer: Keiran Harris
Audio mastering: Milo McGuire, Dominic Armstrong, and Ben Cordell
Transcriptions: Katy Moore
Get the Snipd podcast app
Unlock the knowledge in podcasts with the podcast player of the future.
AI-powered podcast player
Listen to all your favourite podcasts with AI-powered features
Discover highlights
Listen to the best highlights from the podcasts you love and dive into the full episode
Save any moment
Hear something you like? Tap your headphones to save it with AI-generated key takeaways
Share & Export
Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more
AI-powered podcast player
Listen to all your favourite podcasts with AI-powered features
Discover highlights
Listen to the best highlights from the podcasts you love and dive into the full episode