The Truth Neuron: How It Determines Truth in Models

The hope for AI alignment is the same way we'll be able to use AIs to automate sort of AI capabilities research and build more powerful AI systems. So maybe we can use AI to help automate interpretability research. And in fact, there's a paper that came out from OpenAI where they did sort of embryonic steps towards automating interpretability.

Transcript

Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app