Scaling Test Time Compute to Multi-Agent Civilizations — Noam Brown, OpenAI

1185 snips

Jun 19, 2025

Noam Brown, who leads the multi-agent team at OpenAI, shares insights from his groundbreaking work in AI, especially in competitive strategy games like poker and Diplomacy. He discusses the fascinating impact of AI on human gameplay and critiques the constraints of the System 1/2 thinking model in AI reasoning. The conversation also touches on the challenges of test-time compute limitations, multi-agent intelligence, and innovative applications of AI tools like Codex and Windsurf, while pondering the future of AI civilizations.

Ask episode

AI Snips

Chapters

Books

Transcript

Episode notes

00:00 / 00:00

Noam’s Diplomacy Journey

Noam Brown improved his Diplomacy gameplay by deeply studying the game and learning from his AI bot Cicero's unique moves.
This process helped him win the 2025 World Diplomacy Championship several years after releasing Cicero.

00:00 / 00:00

LLM Advances Enable Realistic Bots

Early Diplomacy bots struggled with language quality causing hallucinations and inconsistencies.
Modern large language models now pass the Turing test, making bots much harder to distinguish from humans.

00:00 / 00:00

System 2 Needs Strong System 1

Reasoning (System 2) thinking benefits only emerge after models reach sufficient System 1 capabilities.
Early small models showed little lift from chain-of-thought prompting compared to bigger models.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

Solving Poker and Diplomacy, Debating RL+Reasoning with Ilya, what's *wrong* with the System 1/2 analogy, and where Test-Time Compute hits a wall

Timestamps

00:00 Intro – Diplomacy, Cicero & World Championship
02:00 Reverse Centaur: How AI Improved Noam’s Human Play
05:00 Turing Test Failures in Chat: Hallucinations & Steerability
07:30 Reasoning Models & Fast vs. Slow Thinking Paradigm
11:00 System 1 vs. System 2 in Visual Tasks (GeoGuessr, Tic-Tac-Toe)
14:00 The Deep Research Existence Proof for Unverifiable Domains
17:30 Harnesses, Tool Use, and Fragility in AI Agents
21:00 The Case Against Over-Reliance on Scaffolds and Routers
24:00 Reinforcement Fine-Tuning and Long-Term Model Adaptability
28:00 Ilya’s Bet on Reasoning and the O-Series Breakthrough
34:00 Noam’s Dev Stack: Codex, Windsurf & AGI Moments
38:00 Building Better AI Developers: Memory, Reuse, and PR Reviews
41:00 Multi-Agent Intelligence and the “AI Civilization” Hypothesis
44:30 Implicit World Models and Theory of Mind Through Scaling
48:00 Why Self-Play Breaks Down Beyond Go and Chess
54:00 Designing Better Benchmarks for Fuzzy Tasks
57:30 The Real Limits of Test-Time Compute: Cost vs. Time
1:00:30 Data Efficiency Gaps Between Humans and LLMs
1:03:00 Training Pipeline: Pretraining, Midtraining, Posttraining
1:05:00 Games as Research Proving Grounds: Poker, MTG, Stratego
1:10:00 Closing Thoughts – Five-Year View and Open Research Directions

Chapters

00:00:00 Intro & Guest Welcome
00:00:33 Diplomacy AI & Cicero Insights
00:03:49 AI Safety, Language Models, and Steerability
00:05:23 O Series Models: Progress and Benchmarks
00:08:53 Reasoning Paradigm: Thinking Fast and Slow in AI
00:14:02 Design Questions: Harnesses, Tools, and Test Time Compute
00:20:32 Reinforcement Fine-tuning & Model Specialization
00:21:52 The Rise of Reasoning Models at OpenAI
00:29:33 Data Efficiency in Machine Learning
00:33:21 Coding & AI: Codex, Workflows, and Developer Experience
00:41:38 Multi-Agent AI: Collaboration, Competition, and Civilization
00:45:14 Poker, Diplomacy & Exploitative vs. Optimal AI Strategy
00:52:11 World Models, Multi-Agent Learning, and Self-Play
00:58:50 Generative Media: Image & Video Models
01:00:44 Robotics: Humanoids, Iteration Speed, and Embodiment
01:04:25 Rapid Fire: Research Practices, Benchmarks, and AI Progress
01:14:19 Games, Imperfect Information, and AI Research Directions