AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
How to Improve Mechanistic Interpretability
The more we can help the system understand how it works the more likely start some sort of self-improvement cycle. An argument for keeping discoveries in mechanistic interpretability is to basically not publish those discoveries so again I have mostly problems and very few solutions for you what about the reinforcement learning from from human feedback paradigm could that also perhaps turn out to increase capabilities here? It's less likely to agree to be shut down verbally but that seems to be the pattern, he says.