Latent Space: The AI Engineer Podcast cover image

RLHF 201 - with Nathan Lambert of AI2 and Interconnects

Latent Space: The AI Engineer Podcast

00:00

Direct Preferance Optimization as latest PPO discovery in RL

Direct Preference Optimization (DPO) is a different class of algorithms that derives an optimal reward function from preference data. It uses clever math to emerge an optimal policy based on an implicit reward function. The reward in DPO is based on the difference between two log prob ratios, unlike the classifier reward, which is trained to output a scalar value based on contrastive loss. DPO is simpler to implement and work with, leading to numerous successes in open source. Although it may be slightly more constrained in some ways and harder to formulate in specific cases, DPO is expected to see more models in the next six months.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app