TalkRL: The Reinforcement Learning Podcast cover image

Natasha Jaques 2

TalkRL: The Reinforcement Learning Podcast

CHAPTER

The Challenges and Limitations of RLHF

John Schulman: OpenAI is taking a different approach than we did in our 2019 paper on human feedback, where they train this reward model. In contrast, the stuff I was doing in 2019 was offline RL. So I would use actual human ratings of a specific output and then train on that as like one example of a reward. But I didn't have these generalizable reward model that could be applied across more examples. And there's a good argument to be made that the training of reward model approach actually seems to scale pretty well, because you can sample it so many times.

00:00
Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner