AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
RLHF
The basic idea that was described in the instruct GPT paper was that you could take the response from the language model and you could treat it as a reinforcement learning problem. In reality, getting humans to label every single thing the model says is expensive. So what they do instead is they actually train a separate neural network to act as a reward model. And then they run a standard banded RL algorithm in this case based on PPO.