
ChatGPT and InstructGPT: Aligning Language Models to Human Intention
Deep Papers
00:00
How to Train a Reward Model for a Good Job
Data labelers are giving their preferences over different, say like stories about frogs. We use that data to train one very large neural network, which we call a reward model. And then in a separate step, we the reward model you can think of as like a this almost like the score in a video game or like a teacher. So what the reward model takes is in as input as an instruction and an output and it returns a number. That number tells you how good was this output. If the numbers low, it means the story about frogs was a bad story.
Transcript
Play full episode
Remember Everything You Learn from Podcasts
Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.