TalkRL: The Reinforcement Learning Podcast cover image

Max Schwarzer

TalkRL: The Reinforcement Learning Podcast

CHAPTER

Exploration in BBF

There's an interesting line of work that's come out over the last year or so on a phenomenon called policy churn. Every time you do a gradient step in a value-based method, you change your policy dramatically more than you would think. So for the classic DQN, one gradient step, my policy has changed on 10% of states. And as a result of that, especially when you're training at really high replay ratios, what actually is happening is your policy looks stochastic because every time you're acting in the environment, it's different.

00:00
Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner