

38.8 - David Duvenaud on Sabotage Evaluations and the Post-AGI Future
Mar 1, 2025
In this discussion, David Duvenaud, a University of Toronto professor specializing in probabilistic deep learning and AI safety at Anthropic, dives into the challenges of assessing whether AI models could sabotage human decisions. He shares insights on the complexities of sabotage evaluations and strategies needed for effective oversight. The conversation shifts to the societal impacts of a post-AGI world, reflecting on potential job implications and the delicate balance between AI advancement and prioritizing human values.
AI Snips
Chapters
Transcript
Episode notes
Sabotage Evaluation Challenges
- Evaluating AI sabotage is difficult due to the complexity of defining potential attack strategies.
- Narrow evaluations of specific capabilities are insufficient for building robust safety cases.
Holistic Sabotage Evaluations
- Holistic evaluations that simulate entire attack scenarios are more realistic than narrow evals.
- These holistic evals incorporate oversight and the context of specific AI tasks.
Key Sabotage Evaluation Areas
- Four key areas of sabotage evaluations are sandbagging, undermining oversight, human decision sabotage, and code sabotage.
- These areas represent capability thresholds where AI could potentially cause harm.