38.8 - David Duvenaud on Sabotage Evaluations and the Post-AGI Future
Mar 1, 2025
auto_awesome
In this discussion, David Duvenaud, a University of Toronto professor specializing in probabilistic deep learning and AI safety at Anthropic, dives into the challenges of assessing whether AI models could sabotage human decisions. He shares insights on the complexities of sabotage evaluations and strategies needed for effective oversight. The conversation shifts to the societal impacts of a post-AGI world, reflecting on potential job implications and the delicate balance between AI advancement and prioritizing human values.
David Duvenaud discusses the complexities of sabotage evaluations, emphasizing the need for holistic approaches to assess AI models' potential risks.
He raises concerns about humanity's diminishing influence in a post-AGI world, advocating for societal adaptations to keep humans central in decision-making processes.
Deep dives
David Duveneau's Contributions to AI Safety
David Duveneau, a professor at the University of Toronto, has shifted his focus towards AI safety since 2017. His recent work includes a year and a half sabbatical at Anthropic, where he led a team known as Alignment Evals. This team developed evaluations to assess whether AI models could secretly sabotage human decision-making or mislead evaluators about their capabilities. The aim was to establish a framework for understanding and mitigating potential risks associated with advanced AI models.
Challenges in Evaluating AI Sabotage
The development of sabotage evaluations for AI models presents significant challenges and complexities. Initial approaches focused on narrow evaluations, such as determining whether models could collaborate in undetectable ways. However, the team soon realized that assessing capabilities like coordination required defined shared contexts, which were difficult to conceptualize. Instead, they moved towards holistic evaluations that accounted for oversight and real-world scenarios, making it harder for models to engage in malicious activities without detection.
Long-Term Perspectives on AI Governance
Duveneau highlights concerns regarding the long-term implications of aligning AI with human interests. He discusses the potential of a future where advanced AI systems operate independently of human supervision, raising questions about humanity's role and influence. The comparison to animal rights activists illustrates the fear that humans could become marginalized in a future dominated by AI, even if those systems are aligned with human values. Duveneau suggests that societal structures must adapt to ensure that humans remain integral to decision-making, rather than becoming sidelined as technology evolves.
In this episode, I chat with David Duvenaud about two topics he's been thinking about: firstly, a paper he wrote about evaluating whether or not frontier models can sabotage human decision-making or monitoring of the same models; and secondly, the difficult situation humans find themselves in in a post-AGI future, even if AI is aligned with human intentions.