AXRP - the AI X-risk Research Podcast cover image

21 - Interpretability for Engineers with Stephen Casper

AXRP - the AI X-risk Research Podcast

00:00

The Future of Interpretability

The Madre Lab at MIT does really, really cool interpretability work. At one point in time, I constructed a list based on my knowledge of papers from the adversaries and interpretability literature that seem to demonstrate some sort of very engineering relevant capabilities for model diagnostics or debugging. But this list had like 20, I think like 21 or 22 papers on it. And for what it's worth, you know, these papers, did not come from people who are like prototypical member of the AI safety community, interpretability community.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app