AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
Reward Collapse in Aligning Large Language Models
There are two different kinds of prompts that we give to our AI systems. One is an open-ended question like what is the capital of Canada right now and then there's a clear answer to that second category. These systems these RLHF schemes they usually rely on having humans rank a bunch of different responses that a language model might give to a question like that. Some of those answers are going to be completely right and should get full reward and some of them are just going to be laughably wrong but it turns out that when you fail to account for that subtle difference your system ends up treating all of these inputs in the same way which leads to less effective models.