Navigating Reinforcement Learning in Search Algorithms

This chapter explores the methodologies of reinforcement learning in search algorithms, focusing on techniques like Monte Carlo tree search and stable training methods for language models. Key discussions revolve around Proximal Policy Optimization and Trust Region Policy Optimization, examining hyperparameter tuning and the unpredictable outcomes of extensive training.

Play episode from 32:31

chevron_right

Transcript

chevron_right

Transcript

Episode notes

Finbarr Timbers is an AI researcher who writes Artificial Fintelligence — one of the technical AI blog’s I’ve been recommending for a long time — and has a variety of experiences at top AI labs including DeepMind and Midjourney. The goal of this interview was to do a few things:

* Revisit what reinforcement learning (RL) actually is, its origins, and its motivations.

* Contextualize the major breakthroughs of deep RL in the last decade, from DQN for Atari to AlphaZero to ChatGPT. How could we have seen the resurgence coming? (see the timeline below for the major events we cover)

* Modern uses for RL, o1, RLHF, and the future of finetuning all ML models.

* Address some of the critiques like “RL doesn’t work yet.”

It was a fun one. Listen on Apple Podcasts, Spotify, YouTube, and where ever you get your podcasts. For other Interconnects interviews, go here.

Timeline of RL and what was happening at the time

In the last decade of deep RL, there have been a few phases.

* Era 1: Deep RL fundamentals — when modern algorithms we designed and proven.

* Era 2: Major projects — AlphaZero, OpenAI 5, and all the projects that put RL on the map.

* Era 3: Slowdown — when DeepMind and OpenAI no longer had the major RL projects and cultural relevance declined.

* Era 4: RLHF & widening success — RL’s new life post ChatGPT.

Covering these is the following events. This is incomplete, but enough to inspire a conversation.

Early era: TD Gammon, REINFORCE, Etc

2013: Deep Q Learning (Atari)