AXRP - the AI X-risk Research Podcast cover image

21 - Interpretability for Engineers with Stephen Casper

AXRP - the AI X-risk Research Podcast

00:00

The Importance of Mechanistic Interpretability in AI

For every output of your neural network, there must be like some trigger that would make it a good idea. To the extent that we're able to recognize failure, you know, in theory, we can just always filter outputs that are bad. In practice, though, I think we're going to run into challenges largely involving efficiency, right? Maybe our ability to recognize bad behavior is not something that we can easily tack onto a system alone. Right, so we have to take these shortcuts and train models that are more intrinsically and robustly endogenously trained.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app