AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
The Importance of Reward Hacking
I feel like the way we train our models right now we just like training on all of Imogenet or you know all of Reddit's blog post with more than three karma to like create the pile. I would say probably not and no matter how smart you are if all you see is Imogenet you really don't know much about people. And so the question that we were asking is like, can you have a reward function that is safe to optimize so that isn't the true reward function? It could be good news for alignment if this was the case because then you could say oh yeah learning everything about human values seems really hard.