How to Train a Reward Model for a Good Job

Data labelers are giving their preferences over different, say like stories about frogs. We use that data to train one very large neural network, which we call a reward model. And then in a separate step, we the reward model you can think of as like a this almost like the score in a video game or like a teacher. So what the reward model takes is in as input as an instruction and an output and it returns a number. That number tells you how good was this output. If the numbers low, it means the story about frogs was a bad story.

Transcript

Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app