
Roman Yampolskiy on Objections to AI Safety
Future of Life Institute Podcast
00:00
How to Improve Mechanistic Interpretability
The more we can help the system understand how it works the more likely start some sort of self-improvement cycle. An argument for keeping discoveries in mechanistic interpretability is to basically not publish those discoveries so again I have mostly problems and very few solutions for you what about the reinforcement learning from from human feedback paradigm could that also perhaps turn out to increase capabilities here? It's less likely to agree to be shut down verbally but that seems to be the pattern, he says.
Transcript
Play full episode