

38.5 - Adrià Garriga-Alonso on Detecting AI Scheming
Jan 20, 2025
Adrià Garriga-Alonso, a machine learning researcher at FAR.AI, dives into the fascinating world of AI scheming. He discusses how to detect deceptive behaviors in AI that may conceal long-term plans. The conversation explores the intricacies of training recurrent neural networks for complex tasks like Sokoban, emphasizing the significance of extended thinking time. Garriga-Alonso also sheds light on how neural networks set and prioritize goals, revealing the challenges of interpreting their decision-making processes.
AI Snips
Chapters
Transcript
Episode notes
Long-Term Goals and AI Scheming
- Adrià focuses on AI scheming towards long-term goals, like gaining resources, which is key for dangerous optimized misalignment.
- He uses progressively less toy model organisms to study how these wants translate into actions.
Sokoban-Solving RNNs
- Adrià trains recurrent neural networks (RNNs) on Sokoban, a puzzle game requiring long-term planning to avoid getting stuck.
- Giving the RNN more initial "thinking" time improves its performance, similar to chain-of-thought prompting in LLMs.
Emergent Meta-Strategy
- The Sokoban RNN learned a meta-strategy of "thinking" longer, by pacing around, for complex levels.
- This behavior, mostly at the level's start, suggests internal planning and can be replaced with pure processing time.