[07] John Schulman - Optimizing Expectations: From Deep RL to Stochastic Computation Graphs

Sep 11, 2020

John Schulman, a Research Scientist and co-founder of OpenAI, co-leads efforts in reinforcement learning, focusing on algorithms that learn through trial and error. He shares insights on the evolution from TRPO to PPO and the intricate role of stochastic computation graphs. Schulman discusses the challenges of generalization in RL and how OpenAI Five leveraged these techniques for Dota victories. The conversation also touches on navigating AI alignment challenges and the significance of integrating human intuition into machine learning.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Stochastic Computation Graphs Simplify Gradients

Stochastic computation graphs unify optimization problems with random and deterministic nodes for better gradient estimation.
This standardization simplifies deriving gradients for complex models like reinforcement learning policies.

ANECDOTE

TRPO's Origin Story

John Schulman combined ideas from conservative policy iteration and dynamic programming to create TRPO.
He found that constraining policy updates by KL divergence was key to stable reinforcement learning.

ANECDOTE

OpenAI Five Triumph with PPO

OpenAI Five used PPO and self-play to train agents that eventually beat world champion Dota players.
The project scaled from one-versus-one matches to full five-versus-five games successfully.

Get the Snipd Podcast app to discover more snips from this episode

Get the app