The Reward Model Is a Discriminator in a Reward Learning Loop

The pretrained language model in OpenAI's GPT-3 is used as the policy in a reinforcement learning loop,./nThe reward model is used to simulate human feedback and to fine tune the policy.

Play episode from 26:02

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app