
"Against Almost Every Theory of Impact of Interpretability" by Charbel-Raphaël
LessWrong (Curated & Popular)
00:00
Adversarial Attacks and Model Debugging
The chapter delves into identifying copy-paste attacks in images, discussing relaxed adversarial training, and debugging models for better classification accuracy in critical scenarios. It touches on rewriting tools for improving classifier performance against adversarial examples and explores the challenges in achieving robust defense against out-of-distribution attacks, raising questions on AI deception and interpretability.
Play episode from 01:13:46
Transcript


