AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
How RLHF Works
The RLHF works by using two models. One is called a reward model, and one is the actual model you're trying to train. The base model will get used to generate conversations over and over again for the same initial prompt. And then human raiders will be asked, which of these do you prefer? And those ratings feed into the reward model. They like are used to train the rewardmodel. We could also use the thumbs up thumbs down in chat, in the UI. I'm not sure whether I'm allowed to talk about whether we do that or do that.