AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
How to Train a Reward Model for a Good Job
Data labelers are giving their preferences over different, say like stories about frogs. We use that data to train one very large neural network, which we call a reward model. And then in a separate step, we the reward model you can think of as like a this almost like the score in a video game or like a teacher. So what the reward model takes is in as input as an instruction and an output and it returns a number. That number tells you how good was this output. If the numbers low, it means the story about frogs was a bad story.