AXRP - the AI X-risk Research Podcast

21 - Interpretability for Engineers with Stephen Casper

7 snips
May 2, 2023
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Interpretability for Bug Detection

  • Interpretability helps find and fix bugs in neural networks beyond test set performance.
  • It uniquely aids detection of insidious issues like Trojans and deceptive alignment triggers.
ADVICE

Emphasize Engineering in Interpretability

  • Focus interpretability research on engineering applications to maximize relevance.
  • Benchmarking and practical applications provide clearer progress signals than pure exploration.
INSIGHT

Interplay of Adversaries and Interpretability

  • Interpretability and adversarial research are strongly interconnected and mutually informative.
  • Adversarial examples themselves can serve as interpretability tools revealing model vulnerabilities.
Get the Snipd Podcast app to discover more snips from this episode
Get the app