AXRP - the AI X-risk Research Podcast cover image

21 - Interpretability for Engineers with Stephen Casper

AXRP - the AI X-risk Research Podcast

00:00

The Relationship Between Adversaries and Interpretability

There are very, very close relationships between this type of problem and the types of tools that the interpretability literature is hopefully able to produce. There's another connection too because Trojans are quite a bit like deceptive alignment where, you know, deceptively aligned models are going to have these triggers for bad behavior. So the ability to characterize what models will do in a robust way on unseen anomalous data is kind of a one way of describing the problem of detecting Trojans or solving deceptive alignment.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app