TalkRL: The Reinforcement Learning Podcast cover image

John Schulman

TalkRL: The Reinforcement Learning Podcast

CHAPTER

RL From Human Feedback?

In our case, the only thing we had to learn a model of was the human preference. So it's really like a contextual banded problem. We use that to train a reward model that assigns higher score to the good answers than the bad ones. And there's this other idea called rejection sampling or best event sampling. In general, you can just search against that reward model and take the best one as your action.

00:00
Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner