Optimize for Reward, but Maximize the Reward Function

The human feedback is like only valid for a limited amount of training. Like John was saying, if you train with that same reward model for too far, your performance end up ends up falling off at some point. So I can't say I observe this phenomenon in my own work trying to optimize for reward, but still do something probable under the data. And you can definitely sort of over exploit the reward function. Our agent with those rewards kind of got overly positive, polite and cheerful. so I always joke that it was like the most Canadian dialogue agent you could train.

Transcript

Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app