TalkRL: The Reinforcement Learning Podcast cover image

Max Schwarzer

TalkRL: The Reinforcement Learning Podcast

00:00

Exploration in BBF

There's an interesting line of work that's come out over the last year or so on a phenomenon called policy churn. Every time you do a gradient step in a value-based method, you change your policy dramatically more than you would think. So for the classic DQN, one gradient step, my policy has changed on 10% of states. And as a result of that, especially when you're training at really high replay ratios, what actually is happening is your policy looks stochastic because every time you're acting in the environment, it's different.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app