Latent Space: The AI Engineer Podcast cover image

RLHF 201 - with Nathan Lambert of AI2 and Interconnects

Latent Space: The AI Engineer Podcast

NOTE

Direct Preferance Optimization as latest PPO discovery in RL

Direct Preference Optimization (DPO) is a different class of algorithms that derives an optimal reward function from preference data. It uses clever math to emerge an optimal policy based on an implicit reward function. The reward in DPO is based on the difference between two log prob ratios, unlike the classifier reward, which is trained to output a scalar value based on contrastive loss. DPO is simpler to implement and work with, leading to numerous successes in open source. Although it may be slightly more constrained in some ways and harder to formulate in specific cases, DPO is expected to see more models in the next six months.

00:00
Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner