The Relationship Between Adversaries and Interpretability

There are very, very close relationships between this type of problem and the types of tools that the interpretability literature is hopefully able to produce. There's another connection too because Trojans are quite a bit like deceptive alignment where, you know, deceptively aligned models are going to have these triggers for bad behavior. So the ability to characterize what models will do in a robust way on unseen anomalous data is kind of a one way of describing the problem of detecting Trojans or solving deceptive alignment.

Play episode from 20:19

Transcript

Episode notes

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app