The Dark Side of Interpretability Research

Is it perhaps dangerous to experiment with trying to get empirical data on these behaviors in AI systems? I'm thinking somewhat analogously to perhaps gain a function research in viruses. We do nevertheless need to be finding ways to understand these behaviors, elicit them in safe ways and kind of learn how they can be addressed.

Transcript

Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app