AXRP - the AI X-risk Research Podcast

38.8 - David Duvenaud on Sabotage Evaluations and the Post-AGI Future

Mar 1, 2025
In this discussion, David Duvenaud, a University of Toronto professor specializing in probabilistic deep learning and AI safety at Anthropic, dives into the challenges of assessing whether AI models could sabotage human decisions. He shares insights on the complexities of sabotage evaluations and strategies needed for effective oversight. The conversation shifts to the societal impacts of a post-AGI world, reflecting on potential job implications and the delicate balance between AI advancement and prioritizing human values.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Sabotage Evaluation Challenges

  • Evaluating AI sabotage is difficult due to the complexity of defining potential attack strategies.
  • Narrow evaluations of specific capabilities are insufficient for building robust safety cases.
INSIGHT

Holistic Sabotage Evaluations

  • Holistic evaluations that simulate entire attack scenarios are more realistic than narrow evals.
  • These holistic evals incorporate oversight and the context of specific AI tasks.
INSIGHT

Key Sabotage Evaluation Areas

  • Four key areas of sabotage evaluations are sandbagging, undermining oversight, human decision sabotage, and code sabotage.
  • These areas represent capability thresholds where AI could potentially cause harm.
Get the Snipd Podcast app to discover more snips from this episode
Get the app