Mechanistic Interpretability: A Paradigm of AI Safety

The fact that we survived has not anything to do with a security method, it has to do with the same systems being secured not being substantial. If those systems had been existentially dangerous HCI, yes, I expect we would be dead. It's only because of the limited capabilities of these systems that can be hacked and have been hacked and so on. Let's take another paradigm of AI safety, which is mechanistic interpretability. And this is about understanding what this black box machine learning system is doing. Is this a hopeful paradigm in your opinion? "I think it's definitely something worth working on," he says.

Transcript

Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app