The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Teaching LLMs to Self-Reflect with Reinforcement Learning with Maohao Shen - #726

146 snips
Apr 8, 2025
Maohao Shen, a PhD student at MIT specializing in AI reliability, discusses his groundbreaking work on 'Satori.' He reveals how it enhances language model reasoning through reinforcement learning, enabling self-reflection and exploration. The podcast dives into the innovative Chain-of-Action-Thought approach, which guides models in complex reasoning tasks. Maohao also explains the two-stage training process, including format tuning and self-corrective techniques. The conversation highlights Satori’s impressive performance and its potential to redefine AI reasoning capabilities.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Autoregressive Search

  • Autoregressive search in language models aims to mimic human-like reasoning with self-reflection and exploration.
  • Unlike traditional test-time search methods, it doesn't rely on external guidance or multiple models.
INSIGHT

Chain of Action Thought (COAT)

  • Chain of Action Thought (COAT) uses special tokens (continue, reflect, explore) to guide the model's reasoning actions.
  • This allows the model to self-correct, explore alternatives, and simulate human-like problem-solving, unlike traditional Chain of Thought.
ANECDOTE

Training with COAT

  • Training a language model with COAT involves a two-stage process: format tuning and reinforcement learning.
  • Format tuning familiarizes the model with special action tokens using a multi-agent data synthesis framework, simulating collaborative problem-solving.
Get the Snipd Podcast app to discover more snips from this episode
Get the app