How Do You Train a Reward Model?

We hire a set of contractors to label data for us. And we essentially do an extra fine tuning stage on top of the normal language modeling language model pre training stage. That involves three steps, which I think we'll get into a bit. But essentially the goal is to use reinforcement learning to try to produce outcomes that are closer to the outputs that human would prefer for rank highly.

Transcript

Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app