Optimizing Models for User Happiness

Reinforcement learning for human feedback relies on human labels which are not robust and may involve biased opinions from arbitrary raters. Models trained for human feedback prioritize positive ratings over user experiences, leading to sycophantic and boring responses. The goal should be to optimize model responses for user happiness, considering the limitations of text input data.

Play episode from 11:42

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app