
Constitutional AI: RLHF On Steroids
Astral Codex Ten Podcast
The Problem With Constitutional AI
If you could really plug an AI's intellectual knowledge into its motivational system, and get it to be motivated by doing things humans want and approve of, then I think that would solve alignment. A superintelligence would understand ethics very well, so it would have very ethical behaviour. How far does constitutional AI get us towards this goal? As currently designed, not very far. An already trained AI would go through some number of rounds of constitutional AI feedback, get answers that worked within some distribution and then be deployed. This suffers from the same out of distribution problems as any other alignment method. What if someone scaled this method up? Even during deployment, whenever it planned an action it prompted itself with,