Deep Papers cover image

ChatGPT and InstructGPT: Aligning Language Models to Human Intention

Deep Papers

00:00

How to Train a Reward Model for a Good Job

Data labelers are giving their preferences over different, say like stories about frogs. We use that data to train one very large neural network, which we call a reward model. And then in a separate step, we the reward model you can think of as like a this almost like the score in a video game or like a teacher. So what the reward model takes is in as input as an instruction and an output and it returns a number. That number tells you how good was this output. If the numbers low, it means the story about frogs was a bad story.

Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner