
John Schulman
TalkRL: The Reinforcement Learning Podcast
RL From Human Feedback?
In our case, the only thing we had to learn a model of was the human preference. So it's really like a contextual banded problem. We use that to train a reward model that assigns higher score to the good answers than the bad ones. And there's this other idea called rejection sampling or best event sampling. In general, you can just search against that reward model and take the best one as your action.
00:00
Transcript
Play full episode
Remember Everything You Learn from Podcasts
Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.