
Constitutional AI: RLHF On Steroids
Astral Codex Ten Podcast
00:00
AI Alignment: Is It a Free Lunch?
The graph above shows that constitutionally trained models are less harmful at a given level of helpfulness. This technique isn't just cheaper and easier to control, it's also more effective. Other AI companies want AI's that balance these two goals and end up along some Pareto frontier. They can't be more helpful without sacrificing harmlessness or vice versa. Here anthropic measures helpfulness and harmlessness through LO. A scoring system originally from chess which measures which of two players wins more often. If AI number one has helpfulness LO of 200, and AI number two has helpfulnessLO of 100, and you ask them both a question, AI number one should be more helpful 64% of
Transcript
Play full episode