The Importance of Reward Hacking

I feel like the way we train our models right now we just like training on all of Imogenet or you know all of Reddit's blog post with more than three karma to like create the pile. I would say probably not and no matter how smart you are if all you see is Imogenet you really don't know much about people. And so the question that we were asking is like, can you have a reward function that is safe to optimize so that isn't the true reward function? It could be good news for alignment if this was the case because then you could say oh yeah learning everything about human values seems really hard.

Transcript

Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app