How to Improve Mechanistic Interpretability

The more we can help the system understand how it works the more likely start some sort of self-improvement cycle. An argument for keeping discoveries in mechanistic interpretability is to basically not publish those discoveries so again I have mostly problems and very few solutions for you what about the reinforcement learning from from human feedback paradigm could that also perhaps turn out to increase capabilities here? It's less likely to agree to be shut down verbally but that seems to be the pattern, he says.

Transcript

Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app