Deep Papers cover image

ChatGPT and InstructGPT: Aligning Language Models to Human Intention

Deep Papers

CHAPTER

How Do You Train a Reward Model?

We hire a set of contractors to label data for us. And we essentially do an extra fine tuning stage on top of the normal language modeling language model pre training stage. That involves three steps, which I think we'll get into a bit. But essentially the goal is to use reinforcement learning to try to produce outcomes that are closer to the outputs that human would prefer for rank highly.

00:00
Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner