Is Arl From Human Feedback Resulting in the Right Behavior?

Using machine learning and also human judgments, we can get the models to do this task that's kind of like fuzzier to define. And i think this paper is mentioned in your blook post as something that will not always produce the kind of alignment we want. So if we just like to train our models with human feedback, they might somehow overfit to the feedback we give them. But with a ver caling mighty like all the different cases where we can see wherther's like some rising in outer misallignment. Instead of having the thing just like over fits to te feedback from the humans, we can see, lik, oh, it's starting to have a high loss

Play episode from 40:43

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app