The Evolution of Instruction Tuned Models

Our LHF stands for reinforcement learning on human feedback. The model generates you start with an instruction to model the model generates many completions to a given prompt. And then these prompts are put into order of best to worse by humans. These rankings are used to train a preference model that can imitate the preferences that they provide and provide, and be able to do that ranking automatically on further generations. So it's sort of, it automates this process of the human providing feedback and allows it to fine tune on a much greater scale of data.

Transcript

Play full episode

Transcript

Episode notes

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app