
Natasha Jaques 2
TalkRL: The Reinforcement Learning Podcast
Optimize for Reward, but Maximize the Reward Function
The human feedback is like only valid for a limited amount of training. Like John was saying, if you train with that same reward model for too far, your performance end up ends up falling off at some point. So I can't say I observe this phenomenon in my own work trying to optimize for reward, but still do something probable under the data. And you can definitely sort of over exploit the reward function. Our agent with those rewards kind of got overly positive, polite and cheerful. so I always joke that it was like the most Canadian dialogue agent you could train.
00:00
Transcript
Play full episode
Remember Everything You Learn from Podcasts
Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.