Justified Posteriors

Are We There Yet? Evaluating METR’s Eval of AI’s Ability to Complete Tasks of Different Lengths

Dec 15, 2025
Seth and Andrey delve into the implications of METR's paper on AI's ability to tackle tasks of varying lengths. They discuss the remarkable claim that AI can supposedly double its task-handling capacity every 7 months. The hosts debate the effectiveness of measuring AI via task length versus economic value. They also explore the challenges of long tasks, questioning whether complex projects can truly be broken down into simpler subtasks. Real-world examples, like coordinating Pokémon, highlight AI's ongoing struggles with messy tasks.
Ask episode
AI Snips
Chapters
Books
Transcript
Episode notes
INSIGHT

Length-Based Capability Horizon

  • METR measures AI capability by the maximum task length models can complete autonomously rather than economic value of tasks.
  • They convert model performance into a 50% completion horizon and plot that across frontier models over time.
INSIGHT

Seven-Month Doubling Claim

  • METR reports a ~7-month doubling time for the effective horizon (task length frontier) across evaluated models.
  • That steep trend yields dramatic extrapolations if taken literally into month-long autonomous tasks.
INSIGHT

Software Engineering Focus Matters

  • METR builds its human time baselines from three datasets including short atomic actions and longer RE-Bench eight-hour tasks.
  • The paper omits that its task mix is overwhelmingly software-engineering focused, which biases conclusions.
Get the Snipd Podcast app to discover more snips from this episode
Get the app