The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Teaching LLMs to Self-Reflect with Reinforcement Learning with Maohao Shen - #726

150 snips

Apr 8, 2025

Maohao Shen, a PhD student at MIT specializing in AI reliability, discusses his groundbreaking work on 'Satori.' He reveals how it enhances language model reasoning through reinforcement learning, enabling self-reflection and exploration. The podcast dives into the innovative Chain-of-Action-Thought approach, which guides models in complex reasoning tasks. Maohao also explains the two-stage training process, including format tuning and self-corrective techniques. The conversation highlights Satori’s impressive performance and its potential to redefine AI reasoning capabilities.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Autoregressive Search

Autoregressive search in language models aims to mimic human-like reasoning with self-reflection and exploration.
Unlike traditional test-time search methods, it doesn't rely on external guidance or multiple models.

INSIGHT

Chain of Action Thought (COAT)

Chain of Action Thought (COAT) uses special tokens (continue, reflect, explore) to guide the model's reasoning actions.
This allows the model to self-correct, explore alternatives, and simulate human-like problem-solving, unlike traditional Chain of Thought.

ANECDOTE

Training with COAT

Training a language model with COAT involves a two-stage process: format tuning and reinforcement learning.
Format tuning familiarizes the model with special action tokens using a multi-agent data synthesis framework, simulating collaborative problem-solving.

Get the Snipd Podcast app to discover more snips from this episode

Get the app