
 AXRP - the AI X-risk Research Podcast
 AXRP - the AI X-risk Research Podcast 38.8 - David Duvenaud on Sabotage Evaluations and the Post-AGI Future
 Mar 1, 2025 
 In this discussion, David Duvenaud, a University of Toronto professor specializing in probabilistic deep learning and AI safety at Anthropic, dives into the challenges of assessing whether AI models could sabotage human decisions. He shares insights on the complexities of sabotage evaluations and strategies needed for effective oversight. The conversation shifts to the societal impacts of a post-AGI world, reflecting on potential job implications and the delicate balance between AI advancement and prioritizing human values. 
 AI Snips 
 Chapters 
 Transcript 
 Episode notes 
Sabotage Evaluation Challenges
- Evaluating AI sabotage is difficult due to the complexity of defining potential attack strategies.
- Narrow evaluations of specific capabilities are insufficient for building robust safety cases.
Holistic Sabotage Evaluations
- Holistic evaluations that simulate entire attack scenarios are more realistic than narrow evals.
- These holistic evals incorporate oversight and the context of specific AI tasks.
Key Sabotage Evaluation Areas
- Four key areas of sabotage evaluations are sandbagging, undermining oversight, human decision sabotage, and code sabotage.
- These areas represent capability thresholds where AI could potentially cause harm.
