The Future of Interpretability

The Madre Lab at MIT does really, really cool interpretability work. At one point in time, I constructed a list based on my knowledge of papers from the adversaries and interpretability literature that seem to demonstrate some sort of very engineering relevant capabilities for model diagnostics or debugging. But this list had like 20, I think like 21 or 22 papers on it. And for what it's worth, you know, these papers, did not come from people who are like prototypical member of the AI safety community, interpretability community.

Transcript

Play full episode

Transcript

Episode notes

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app