
John Schulman
TalkRL: The Reinforcement Learning Podcast
00:00
RL From Human Feedback?
In our case, the only thing we had to learn a model of was the human preference. So it's really like a contextual banded problem. We use that to train a reward model that assigns higher score to the good answers than the bad ones. And there's this other idea called rejection sampling or best event sampling. In general, you can just search against that reward model and take the best one as your action.
Transcript
Play full episode