The RLHF works by using two models. One is called a reward model, and one is the actual model you're trying to train. The base model will get used to generate conversations over and over again for the same initial prompt. And then human raiders will be asked, which of these do you prefer? And those ratings feed into the reward model. They like are used to train the rewardmodel. We could also use the thumbs up thumbs down in chat, in the UI. I'm not sure whether I'm allowed to talk about whether we do that or do that.
In this supper club episode of Syntax, Wes and Scott talk with Andrey Mishchenko about working at OpenAI, getting started in machine learning, what RLHF is, the impact of plugins on ChatGPT, and what the future holds for developers in a world with ChatGPT tools.
Show Notes
××× SIIIIICK ××× PIIIICKS ×××
Shameless Plugs
Tweet us your tasty treats