TD Learning

Q-learning is a paradigm shift, because you don't have to do all these dynamic programming or policy gradients. You just let the deep neural network figure out what it means to optimize long-term reward. In some results, this isn't transformer, it gets about the same as TD learning. And in some cases, it does better. How is it doing this? I was a little surprised that TD learning is represented by CQL. Would there be other algorithms that might do better to represent TD learning here? Many people are working on that.

Play episode from 20:26

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app