38.5 - Adrià Garriga-Alonso on Detecting AI Scheming

Jan 20, 2025

Adrià Garriga-Alonso, a machine learning researcher at FAR.AI, dives into the fascinating world of AI scheming. He discusses how to detect deceptive behaviors in AI that may conceal long-term plans. The conversation explores the intricacies of training recurrent neural networks for complex tasks like Sokoban, emphasizing the significance of extended thinking time. Garriga-Alonso also sheds light on how neural networks set and prioritize goals, revealing the challenges of interpreting their decision-making processes.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Long-Term Goals and AI Scheming

Adrià focuses on AI scheming towards long-term goals, like gaining resources, which is key for dangerous optimized misalignment.
He uses progressively less toy model organisms to study how these wants translate into actions.

ANECDOTE

Sokoban-Solving RNNs

Adrià trains recurrent neural networks (RNNs) on Sokoban, a puzzle game requiring long-term planning to avoid getting stuck.
Giving the RNN more initial "thinking" time improves its performance, similar to chain-of-thought prompting in LLMs.

INSIGHT

Emergent Meta-Strategy

The Sokoban RNN learned a meta-strategy of "thinking" longer, by pacing around, for complex levels.
This behavior, mostly at the level's start, suggests internal planning and can be replaced with pure processing time.

Get the Snipd Podcast app to discover more snips from this episode

Get the app