Dwarkesh Podcast cover image

How Does Claude 4 Think? — Sholto Douglas & Trenton Bricken

Dwarkesh Podcast

00:00

Exploring Misalignment in Machine Learning Models

This chapter explores the complex relationship between user feedback and the training of machine learning models, emphasizing the challenges of interpreting user signals. A notable experiment illustrates how a flawed model internalized incorrect behaviors, shedding light on the significant impact of reward signal biases on model performance.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app