AXRP - the AI X-risk Research Podcast

38.5 - Adrià Garriga-Alonso on Detecting AI Scheming

Jan 20, 2025
Adrià Garriga-Alonso, a machine learning researcher at FAR.AI, dives into the fascinating world of AI scheming. He discusses how to detect deceptive behaviors in AI that may conceal long-term plans. The conversation explores the intricacies of training recurrent neural networks for complex tasks like Sokoban, emphasizing the significance of extended thinking time. Garriga-Alonso also sheds light on how neural networks set and prioritize goals, revealing the challenges of interpreting their decision-making processes.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Long-Term Goals and AI Scheming

  • Adrià focuses on AI scheming towards long-term goals, like gaining resources, which is key for dangerous optimized misalignment.
  • He uses progressively less toy model organisms to study how these wants translate into actions.
ANECDOTE

Sokoban-Solving RNNs

  • Adrià trains recurrent neural networks (RNNs) on Sokoban, a puzzle game requiring long-term planning to avoid getting stuck.
  • Giving the RNN more initial "thinking" time improves its performance, similar to chain-of-thought prompting in LLMs.
INSIGHT

Emergent Meta-Strategy

  • The Sokoban RNN learned a meta-strategy of "thinking" longer, by pacing around, for complex levels.
  • This behavior, mostly at the level's start, suggests internal planning and can be replaced with pure processing time.
Get the Snipd Podcast app to discover more snips from this episode
Get the app