38.8 - David Duvenaud on Sabotage Evaluations and the Post-AGI Future

Mar 1, 2025

In this discussion, David Duvenaud, a University of Toronto professor specializing in probabilistic deep learning and AI safety at Anthropic, dives into the challenges of assessing whether AI models could sabotage human decisions. He shares insights on the complexities of sabotage evaluations and strategies needed for effective oversight. The conversation shifts to the societal impacts of a post-AGI world, reflecting on potential job implications and the delicate balance between AI advancement and prioritizing human values.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Sabotage Evaluation Challenges

Evaluating AI sabotage is difficult due to the complexity of defining potential attack strategies.
Narrow evaluations of specific capabilities are insufficient for building robust safety cases.

INSIGHT

Holistic Sabotage Evaluations

Holistic evaluations that simulate entire attack scenarios are more realistic than narrow evals.
These holistic evals incorporate oversight and the context of specific AI tasks.

INSIGHT

Key Sabotage Evaluation Areas

Four key areas of sabotage evaluations are sandbagging, undermining oversight, human decision sabotage, and code sabotage.
These areas represent capability thresholds where AI could potentially cause harm.

Get the Snipd Podcast app to discover more snips from this episode

Get the app