AXRP - the AI X-risk Research Podcast

chevron_right

21 - Interpretability for Engineers with Stephen Casper

whatshot 7 snips

May 2, 2023

01:56:02

forum

Ask episode

web_stories

AI Snips

view_agenda

Chapters

auto_awesome

Transcript

info_circle

Episode notes

insights

INSIGHT

Interpretability for Bug Detection

Interpretability helps find and fix bugs in neural networks beyond test set performance.
It uniquely aids detection of insidious issues like Trojans and deceptive alignment triggers.

volunteer_activism

ADVICE

Emphasize Engineering in Interpretability

Focus interpretability research on engineering applications to maximize relevance.
Benchmarking and practical applications provide clearer progress signals than pure exploration.

insights

INSIGHT

Interplay of Adversaries and Interpretability

Interpretability and adversarial research are strongly interconnected and mutually informative.
Adversarial examples themselves can serve as interpretability tools revealing model vulnerabilities.

Get the Snipd Podcast app to discover more snips from this episode

The Importance of Interpretability in Safe AI

01:30 • 2min

chevron_right

The Importance of Interpretability in Engineering Applications

03:49 • 3min

chevron_right

The Future of Deep Learning

06:26 • 2min

chevron_right

The Future of Interpretability

08:40 • 4min

chevron_right

The Relationship Between Interpretability and Anomalies

12:55 • 2min

chevron_right

The Duality of Interpretability Tools and Anomalies

14:28 • 2min

chevron_right

The Relationship Between Anomalies and Interpretability

16:09 • 2min

chevron_right

The Predictability of Networks in AI Interpretability

18:13 • 2min

chevron_right

The Relationship Between Adversaries and Interpretability

20:19 • 2min

chevron_right

The Human Centric Approach to Interpretability

22:10 • 2min

chevron_right

The Future of Interpretability Research

24:29 • 2min

chevron_right

The Role of Mechanistic Interpretability

26:56 • 3min

chevron_right

The Hardness of Hypothesis Generation

30:17 • 3min

chevron_right

The Rices Theorem and the Neural Network

33:44 • 2min

chevron_right

How to Make Networks That Are Good for This in the First Place

35:21 • 2min

chevron_right

The Importance of Intrinsic Interpretability in AI Safety

37:03 • 3min

chevron_right

The Future of Deep Neural Networks

39:34 • 3min

chevron_right

The AI Safety Interpretability Community

42:27 • 5min

chevron_right

The Relationship Between Disentanglement and Policy

47:02 • 3min

chevron_right

Softmax Linear Units and the AI Safety Interpretability Community

49:38 • 2min

chevron_right

AI Safety Interpretability: What You Think Is Under Celebrated?

51:36 • 3min

chevron_right

The Future of Interpretability

54:30 • 2min

chevron_right

The Relationship Between Mechanistic Interpretability and Deceptive Alignment

56:09 • 5min

chevron_right

How to Make a Model More Robust to Adversarial Inputs

01:01:27 • 2min

chevron_right

The Importance of Mechanistic Interpretability in AI

01:03:10 • 4min

chevron_right

Trojans and Trojans: A Comparison

01:07:30 • 2min

chevron_right

Benchmarking Interpretability Tools for Deep Neural Networks

01:09:49 • 5min

chevron_right

Benchmarking Interpretability Tools for Trojan Detection

01:14:55 • 2min

chevron_right

The Problems With Interpretability Methods for Trojan Detection

01:17:01 • 3min

chevron_right

The Failure of Saliency and Attribution Methods

01:20:00 • 2min

chevron_right

How to Use Attribution and Saliency Methods to Answer Practical Questions

01:22:00 • 4min

chevron_right

Benchmarking for Feature Saliency and Attribution

01:26:16 • 2min

chevron_right

The Importance of Human Oversight in the Design of Trojans

01:28:37 • 3min

chevron_right

How to Generate a Trojan Image

01:31:18 • 5min

chevron_right

How to Check Forks in Style Transfer Images

01:36:07 • 2min

chevron_right

How to Evaluate the Input Synthesis Methods

01:37:43 • 3min

chevron_right

The Benefits of Fine Tuning a Generator for Forks

01:40:26 • 3min

chevron_right

Benchmarks for Interpretability Tools

01:43:09 • 3min

chevron_right

The Future of Interpretability

01:46:07 • 3min

chevron_right

Red Teaming the Stable Diffusion Safety Filter

01:49:05 • 3min

chevron_right

How to Interpretability Work and Traditions

01:52:27 • 3min

chevron_right

Lots of people in the field of machine learning study 'interpretability', developing tools that they say give us useful information about neural networks. But how do we know if meaningful progress is actually being made? What should we want out of these tools? In this episode, I speak to Stephen Casper about these questions, as well as about a benchmark he's co-developed to evaluate whether interpretability tools can find 'Trojan horses' hidden inside neural nets.

Patreon: patreon.com/axrpodcast

Ko-fi: ko-fi.com/axrpodcast

Topics we discuss, and timestamps:

- 00:00:42 - Interpretability for engineers

- 00:00:42 - Why interpretability?

- 00:12:55 - Adversaries and interpretability

- 00:24:30 - Scaling interpretability

- 00:42:29 - Critiques of the AI safety interpretability community

- 00:56:10 - Deceptive alignment and interpretability

- 01:09:48 - Benchmarking Interpretability Tools (for Deep Neural Networks) (Using Trojan Discovery)

- 01:10:40 - Why Trojans?

- 01:14:53 - Which interpretability tools?

- 01:28:40 - Trojan generation

- 01:38:13 - Evaluation

- 01:46:07 - Interpretability for shaping policy

- 01:53:55 - Following Casper's work

The transcript: axrp.net/episode/2023/05/02/episode-21-interpretability-for-engineers-stephen-casper.html

Links for Casper:

- Personal website: stephencasper.com/

- Twitter: twitter.com/StephenLCasper

- Electronic mail: scasper [at] mit [dot] edu

Research we discuss:

- The Engineer's Interpretability Sequence: alignmentforum.org/s/a6ne2ve5uturEEQK7

- Benchmarking Interpretability Tools for Deep Neural Networks: arxiv.org/abs/2302.10894

- Adversarial Policies beat Superhuman Go AIs: goattack.far.ai/

- Adversarial Examples Are Not Bugs, They Are Features: arxiv.org/abs/1905.02175

- Planting Undetectable Backdoors in Machine Learning Models: arxiv.org/abs/2204.06974

- Softmax Linear Units: transformer-circuits.pub/2022/solu/index.html

- Red-Teaming the Stable Diffusion Safety Filter: arxiv.org/abs/2210.04610

Episode art by Hamish Doodles: hamishdoodles.com

Home Top podcasts Popular guests Top books