Data Brew by Databricks

Reward Models | Data Brew | Episode 40

12 snips
Mar 20, 2025
Brandon Cui, a Research Scientist at MosaicML and Databricks, specializes in AI model optimization and leads RLHF efforts. In this discussion, he unveils how synthetic data and RLHF can fine-tune models for better outcomes. He explores techniques like Policy Proximal Optimization and Direct Preference Optimization that enhance model responses. Brandon also emphasizes the critical role of reward models in boosting performance in coding, math, and reasoning tasks, while highlighting the necessity of human oversight in AI training.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Reward Model Purpose

  • Reward models excel at scoring generations based on criteria like helpfulness and safety.
  • They determine if content fits user needs, enabling quality assessment.
ADVICE

Training Reward Models

  • Train reward models using pairwise comparisons: present two responses for a prompt and indicate preference.
  • Gather ample preference data to train the model, scoring chosen responses higher.
INSIGHT

Preference Data Efficiency

  • Preference data collection for reward models is simpler and cheaper than gathering instruction fine-tuning data.
  • It's easier for humans to compare options than to generate responses.
Get the Snipd Podcast app to discover more snips from this episode
Get the app