
Natasha Jaques 2
TalkRL: The Reinforcement Learning Podcast
The Challenges and Limitations of RLHF
John Schulman: OpenAI is taking a different approach than we did in our 2019 paper on human feedback, where they train this reward model. In contrast, the stuff I was doing in 2019 was offline RL. So I would use actual human ratings of a specific output and then train on that as like one example of a reward. But I didn't have these generalizable reward model that could be applied across more examples. And there's a good argument to be made that the training of reward model approach actually seems to scale pretty well, because you can sample it so many times.
00:00
Transcript
Play full episode
Remember Everything You Learn from Podcasts
Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.