How to Do a Back Flip in a Language Model?

Like a language model, you can think of the am each token or like word as an action and a. And the kind of reward you get is depend on how good your output is compared to the other one. So imagine you have ten different outputs for a prompt a. The human levele levelers will ik compare differentprompts ory comare different Outputs. Then if your output was always compared as like the best one, it will get like a high reward or a high core for the lebers. Yea, that's, that's roughly. I think it can be done in different ways.

Play episode from 54:57

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app