Latent Space: The AI Engineer Podcast

[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang

318 snips
Dec 31, 2025
Join John Yang, a Stanford PhD student and the mind behind SWE-bench and CodeClash, as he shares insights from the cutting-edge world of AI coding benchmarks. Discover how SWE-bench went from zero to industry standard in mere months, the limitations of traditional unit tests, and the innovative long-horizon tournaments of CodeClash. Yang dives into the debate around Tau-bench's 'impossible tasks' and explores the balance between autonomous agents and interactive workflows. Get ready for a glimpse into the future of human-AI collaboration!
Ask episode
AI Snips
Chapters
Transcript
Episode notes
00:00 / 00:00

SWE-bench's Surprise Breakout

  • John Yang recounts SWE-bench's slow start after its October 2023 release and the sudden boost after Cognition's Devin launch.
  • He describes getting an excited email from Walden two weeks before Devin's public release praising the benchmark.
00:00 / 00:00

Benchmarks Need Thoughtful Curation

  • SWE-bench evolved into many variants including multimodal and multilingual splits to avoid Django-heavy bias.
  • John expects future splits to justify difficulty with curation techniques rather than just more repos.
00:00 / 00:00

Evaluate Long-Horizon Development

  • Move beyond single-shot unit-test verification when evaluating coding agents and embrace long-horizon development scenarios.
  • John proposes CodeClash: agents maintain codebases, iterate, and compete in arenas over multiple rounds for richer evaluation.
Get the Snipd Podcast app to discover more snips from this episode
Get the app