
ChatGPT and InstructGPT: Aligning Language Models to Human Intention
Deep Papers
How Do You Train a Reward Model?
We hire a set of contractors to label data for us. And we essentially do an extra fine tuning stage on top of the normal language modeling language model pre training stage. That involves three steps, which I think we'll get into a bit. But essentially the goal is to use reinforcement learning to try to produce outcomes that are closer to the outputs that human would prefer for rank highly.
00:00
Transcript
Play full episode
Remember Everything You Learn from Podcasts
Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.