TalkRL: The Reinforcement Learning Podcast cover image

Natasha Jaques 2

TalkRL: The Reinforcement Learning Podcast

00:00

The Challenges and Limitations of RLHF

John Schulman: OpenAI is taking a different approach than we did in our 2019 paper on human feedback, where they train this reward model. In contrast, the stuff I was doing in 2019 was offline RL. So I would use actual human ratings of a specific output and then train on that as like one example of a reward. But I didn't have these generalizable reward model that could be applied across more examples. And there's a good argument to be made that the training of reward model approach actually seems to scale pretty well, because you can sample it so many times.

Play episode from 05:02
Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app