AI Alignment: Is It a Free Lunch?

The graph above shows that constitutionally trained models are less harmful at a given level of helpfulness. This technique isn't just cheaper and easier to control, it's also more effective. Other AI companies want AI's that balance these two goals and end up along some Pareto frontier. They can't be more helpful without sacrificing harmlessness or vice versa. Here anthropic measures helpfulness and harmlessness through LO. A scoring system originally from chess which measures which of two players wins more often. If AI number one has helpfulness LO of 200, and AI number two has helpfulnessLO of 100, and you ask them both a question, AI number one should be more helpful 64% of

Transcript

Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app