Solving Poker and Diplomacy, Debating RL+Reasoning with Ilya, what's *wrong* with the System 1/2 analogy, and where Test-Time Compute hits a wall
Timestamps
00:00 Intro – Diplomacy, Cicero & World Championship 
02:00 Reverse Centaur: How AI Improved Noam’s Human Play 
05:00 Turing Test Failures in Chat: Hallucinations & Steerability 
07:30 Reasoning Models & Fast vs. Slow Thinking Paradigm 
11:00 System 1 vs. System 2 in Visual Tasks (GeoGuessr, Tic-Tac-Toe) 
14:00 The Deep Research Existence Proof for Unverifiable Domains 
17:30 Harnesses, Tool Use, and Fragility in AI Agents 
21:00 The Case Against Over-Reliance on Scaffolds and Routers 
24:00 Reinforcement Fine-Tuning and Long-Term Model Adaptability 
28:00 Ilya’s Bet on Reasoning and the O-Series Breakthrough 
34:00 Noam’s Dev Stack: Codex, Windsurf & AGI Moments 
38:00 Building Better AI Developers: Memory, Reuse, and PR Reviews 
41:00 Multi-Agent Intelligence and the “AI Civilization” Hypothesis 
44:30 Implicit World Models and Theory of Mind Through Scaling 
48:00 Why Self-Play Breaks Down Beyond Go and Chess 
54:00 Designing Better Benchmarks for Fuzzy Tasks 
57:30 The Real Limits of Test-Time Compute: Cost vs. Time 
1:00:30 Data Efficiency Gaps Between Humans and LLMs 
1:03:00 Training Pipeline: Pretraining, Midtraining, Posttraining 
1:05:00 Games as Research Proving Grounds: Poker, MTG, Stratego 
1:10:00 Closing Thoughts – Five-Year View and Open Research Directions