RL From Human Feedback?

In our case, the only thing we had to learn a model of was the human preference. So it's really like a contextual banded problem. We use that to train a reward model that assigns higher score to the good answers than the bad ones. And there's this other idea called rejection sampling or best event sampling. In general, you can just search against that reward model and take the best one as your action.

Play episode from 09:44

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app