TalkRL: The Reinforcement Learning Podcast cover image

Natasha Jaques 2

TalkRL: The Reinforcement Learning Podcast

00:00

The Reward Model Isn't Perfect, Right?

The accuracy of the reward model on validation data, it's like in the 70s or something. So you really overfit to that reward model. It's not clear that it's going to be comprehensive enough to describe good, good outputs. But at the end of the day, your loss is still being propagated into your model by increasing or decreasing token level probabilities.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app