AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
How Mechanistic Interpretability Can Shift Probability Mass Between Models
I think mechanistic interpretability is gonna tell us what's going on when we try to align models. I think it's basically gonna teach us about this. Like one way I can imagine concluding that things are very difficult is if mechanistic interpretable sort of shows us that, you know, problems tend to get moved around instead of being stamped out. It might inspire us or give us insight into why problems are kind of persistent or hard to eradicate or crop up. But like the kind of thing that would really be like, oh man, we can't solve this, is like we see it happening inside the x-ray. Because yeah, because there's just too many assumptions,. There