

Teaching LLMs to Self-Reflect with Reinforcement Learning with Maohao Shen - #726
146 snips Apr 8, 2025
Maohao Shen, a PhD student at MIT specializing in AI reliability, discusses his groundbreaking work on 'Satori.' He reveals how it enhances language model reasoning through reinforcement learning, enabling self-reflection and exploration. The podcast dives into the innovative Chain-of-Action-Thought approach, which guides models in complex reasoning tasks. Maohao also explains the two-stage training process, including format tuning and self-corrective techniques. The conversation highlights Satori’s impressive performance and its potential to redefine AI reasoning capabilities.
AI Snips
Chapters
Transcript
Episode notes
Autoregressive Search
- Autoregressive search in language models aims to mimic human-like reasoning with self-reflection and exploration.
- Unlike traditional test-time search methods, it doesn't rely on external guidance or multiple models.
Chain of Action Thought (COAT)
- Chain of Action Thought (COAT) uses special tokens (continue, reflect, explore) to guide the model's reasoning actions.
- This allows the model to self-correct, explore alternatives, and simulate human-like problem-solving, unlike traditional Chain of Thought.
Training with COAT
- Training a language model with COAT involves a two-stage process: format tuning and reinforcement learning.
- Format tuning familiarizes the model with special action tokens using a multi-agent data synthesis framework, simulating collaborative problem-solving.