AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
How Much Human Feedback Scales to What We're Doing at GPT?
Chat GPT uses a three-step process to train language models. The first step is pre-train the model, then gather this sort of human preference data and train a reward model. Now the second reward model is trained to take in a prompt and a response and score it like a human would score it according to preference. And then the third and final step is that you fine tune a copy of your original language model using this trained reward model and a reinforcement learning loop.